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[Document name] Specification 
[Title of the Invention] 

A computer system for generating a data structure for information retrie 
val, a method thereof, a computer executable program for generating a da 
ta structure for information retrieval, a computer readable medium stori 
ng the program for generating a data structure for information retrieval 
, an information retrieval system, and a graphical user interface system 
[Claims] 

Claim 1 A computer system for generating data structures for information 
retrieval of documents stored in a database, said documents being store 
d as document-keyword vectors generated from a predetermined keyword lis 
t, and said document-keyword vectors forming nodes of a hierarchical str 
ucture imposed upon said documents, said computer system comprising: 
a neighborhood patch generation part for generating groups of nodes havi 
ng similarities as determined using a search structure, said neighborhoo 
d patch generation part including a part for generating a hierarchical s 
tructure upon said document-keyword vectors and a patch defining part fo 
r creating patch relationships among said nodes with respect to a metric 

distance between nodes; and 
a cluster estimation part for generating cluster data of said document-k 
eyword vectors using said similarities of patches. 

Claim 2 The computer system of claim 1, wherein said computer system com 
prises a confidence determination part for computing inter-patch confide 
nee values between said patches and intra-patch confidence values, and s 
aid cluster estimation part selects said patches depending on said inter 
-patch confidence values to represent clusters of said document-keyword 
vectors. 
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Claim 3 The computer system of claim 1, wherein said cluster estimation 
part estimates sizes of said clusters depending on said intra-patch conf 
idence values. 

Claim 4 A method for generating data structures for information retrieva 
1 of documents stored in a database, said documents being stored as docu 
ment-keyword vectors generated from a predetermined keyword list, and sa 
id document-keyword vectors forming nodes of a hierarchical structure im 
posed upon said documents, said method comprising the steps of: 
generating a hierarchical structure upon said document-keyword vectors a 
nd storing hierarchy data in an adequate storage area; 

generating neighborhood patches of nodes having similarities as determin 
ed using levels of the hierarchical structure, and storing said patches 
in an adequate storage area; 

invoking said hierarchy data and said patches to compute inter-patch con 
f idence values between said patches and intra-patch confidence values, a 
nd storing said values as corresponding lists in an adequate storage are 
a; and 

selecting said patches depending on said inter-patch confidence values a 
nd said intra-patch confidence values to represent clusters of said docu 
ment-keyword vectors. 

Claim 5 The method according to claim 4 further comprising the step of e 
stimating sizes of said clusters depending on said intra-patch confidenc 
e values. 

Claim 6 A program for making a computer system execute a method for gene 
rating data structures for information retrieval of documents stored in 
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a database, said documents being stored as document-keyword vectors gene 
rated from a predetermined keyword list, and said document-keyword vecto 
rs forming nodes of a hierarchical structure introduced into said docume 
nts, said program making said computer system execute the steps of: 
generating a hierarchical structure upon said document-keyword vectors a 
nd storing hierarchy data in an adequate storage area; 
generating neighborhood patches consisting of nodes having similarities 
as determined using levels of the hierarchical structure, and storing sa 
id patches in an adequate storage area; 

invoking said hierarchy data and said patches to compute inter-patch con 
fidence values between said patches and intra-patch confidence values, a 
nd storing said values as corresponding lists in an adequate storage are 
a; and 

selecting said patches depending on said inter-patch confidence values a 
nd said intra-patch confidence values to represent clusters of said docu 
ment-keyword vectors. 

Claim 7 The method according to claim 6, further comprising the step of 
estimating sizes of said clusters depending on said intra-patch confiden 
ce values. 

Claim 8 A computer readable medium storing a program for making a comput 
er system execute a method for generating data structures for informatio 
n retrieval of documents stored in a database, said documents being stor 
ed as document-keyword vectors generated from a predetermined keyword li 
st, and said document-keyword vectors forming nodes of a hierarchical st 
ructure imposed upon said documents, said program making said computer s 
ystem execute the steps of : 

generating a hierarchical structure upon said document-keyword vectors a 
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nd storing hierarchy data in an adequate storage area; 
generating neighborhood patches consisting of nodes having similarities 
as determined using levels of the hierarchical structure, and storing sa 
id patch list in an adequate storage area; 

invoking said hierarchy data and said patches to compute inter-patch con 
fidence values between said patches and intra-patch confidence values, a 
nd storing said values as corresponding lists in an adequate storage are 
a; and 

selecting said patches depending on said inter-patch confidence values a 
nd said intra-patch confidence values to represent clusters of said docu 
ment-keyword vectors. 

Claim 9 The method according to claim 8, further comprising the step of 
estimating sizes of said clusters depending on said intra-patch confiden 
ce values. 

Claim 10 An information retrieval system for of documents stored in a da 
tabase, said documents being stored as document-keyword vectors generate 
d from a predetermined keyword list, and said document-keyword vectors f 
orming nodes of a hierarchical structure imposed upon said documents, sa 
fJ % id system comprising: 

a neighborhood patch generation part for generating groups of nodes havi 
ng similarities as determined using a hierarchical structure, said patch 
generation part including a part for generating a hierarchical structur 
e upon said document-keyword vectors and a patch defining part for creat 
ing patch relationships among said nodes with respect to a metric distan 
ce between nodes; and 

a cluster estimation part for generating cluster data of said document-k 
eyword vectors using said similarities of patches; and 
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a graphical user interface part for presenting said estimated cluster da 
ta on a display means. 

Claim 11 The computer system of claim 10, wherein said information retri 
eval system comprises a confidence determination part for computing inte 
r-patch confidence values between said patches and intra-patch confidenc 
e values, and said cluster estimation part selects said patches dependin 
g on said inter-patch confidence values to represent clusters of said do 
cument-keyword vectors. 

Claim 12 The system of claim 10, wherein said cluster estimation part es 
timates sizes of said clusters depending on said intra-patch confidence 
values. 

Claim 13 The system of claim 10, wherein said system further comprises a 
user query receiving part for receiving said query and extracting data 

for information retrieval to generate a query vector, and an information 
retrieval part for computing similarities between said document-keyword 
vectors and said query vector to select said document-keyword vectors. 

Claim 14 The system of claim 10, wherein said clusters are estimated usi 
ng said retrieved document-keyword vectors with respect to said user inp 
ut query. 

Claim 15 A graphical user interface system for graphically presenting es 
timated clusters on a display device in response to a user input query, 
said graphical user interface system comprising: 
a database for storing documents; 

a computer for generating document-keyword vectors for said documents st 
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ored in said database and for estimating clusters of documents in respon 
se to said user^ input query; and 

a display for displaying on screen said estimated clusters together with 
confidence relations between said clusters and hierarchical information 
pertaining to cluster size. 

Claim 16 The graphical user interface system of claim 15, wherein said c 
omputer comprises: 

a neighborhood patch generation part for generating groups of nodes havi 
ng similarities as determined using a search structure, said neighborhoo 
d patch generation part including a part for generating a hierarchical s 
tructure upon said document-keyword vectors and a patch defining part fo 
r creating patch relationships among said nodes with respect to a metric 

distance between nodes; and 
a cluster estimation part for generating cluster data of said document-k 
eyword vectors using said similarities of patches. 

Claim 17 The graphical user interface system of claim 15, wherein said c 
omputer comprises a confidence determination part for computing inter-pa 
tch confidence values between said patches and intra-patch confidence va 
4 lues, and said cluster estimation part selects said patches depending on 
said inter-patch confidence values to represent clusters of said docume 
nt-keyword vectors and said cluster estimation part estimates sizes of s 
aid clusters depending on said intra-patch confidence values. 

[Detailed Explanation of Invention] 
[Field of Invention] 

The present invention relates to information retrieval from a large data 
base, and more particularly relates to a computer system for generating 
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a data structure for information retrieval, a method thereof, a computer 
executable program for generating a data structure for information retr 
ieval, a computer readable medium storing the program for generating a d 
ata structure for information retrieval, an information retrieval system 
, and a graphical user interface system. 

[Background of the Art] 

Recently, information processing systems are increasingly expected to ha 
ndle large amounts of data such as, for example, news data, client infor 
mation, patent information, and stock market data. Users of such databas 
es find it increasingly difficult to search for desired information quic 
kly and effectively with sufficient accuracy. Therefore, timely, accurat 
e, and inexpensive detection of documents from large databases may provi 
de very valuable information for many types of businesses. In addition, 
sometimes users wish to obtain further information related to data retri 
eved, such as cluster information in the database, and the interrelation 
ships among such clusters. 

Typical methods for detecting clusters rely upon a measure of similarity 
between data elements; such methods based on similarity search have bee 
X n proposed so far as summarized below. 

Similarity search (also known as proximity search) is one in which items 
of a database are sought according to how well they match a given query 
element. Similarity (or rather, dissimilarity) is typically modeled usi 

ng some real- or integer-valued distance 'metric' dist: that is, 

(1) dist(p, q) ^ 0 for all p, q (non-negativity); 

(2) dist(p, q) = dist(q, p) for all p, q (symmetry); 

(3) dist(p, q) = 0 if and only if p = q; 
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(4) dist(p, q) + dist(q, r) ^ dist(p, r) for all p, q, r (triangle ineq 
ual ity) . 

Any set of objects for which such a distance function exists is called a 
metric space. A data structure that allows a reduction in the number of 
distance evaluations at query time is known as an index. Many methods f 

or similarity queries have been proposed. Similarity queries on metric s 

paces are of two general types, as stated below: 

(A) k-nearest-neighbor query: given a query element q and a positive int 
eger k, report the k closest database elements to q. 

(B) range query: given a query element q and a distance r, report every 
database item p such that dist(p, q) ^ r. 

For large databases, it is too expensive to perform similarity queries b 
y means of explicitly computing the distances from the query element to 
every database element. Previous computation and storage of all distance 
s among database elements is also too expensive, as this would require t 
ime and space proportional to the square of the number of database eleme 
nts (that is, quadratic time and space). A more practical goal is to con 
struct a search structure that can handle queries in sub-linear time usi 
ng sub-quadratic storage and preprocessing time. 

A. Review of Vector Space Models 

Current information retrieval methods often uses vector space modeling t 
o represent the documents of databases. In such vector space models, eac 
h document in the database under consideration is associated with a vect 
or, each coordinate of which represents a keyword or attribute of the do 
cument; details of the vector space models are provided elsewhere (Geral 
d Sal ton, The SMART Retrieval System - Experiments in Automatic Document 
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Processing, Prentice-Hall, Englewood Cliffs, NJ, USA, 1971). 
B. Brief Survey of Similarity Search Structures 

A great variety of structures have been proposed over the past thirty ye 
ars for handling similarity queries. The majority of these are spatial i 
ndices, which require that the object set be modeled as a vector of d re 
al-valued attributes. Others are 'metric' indices, which make no assump 
tions on the nature of the database elements other than the existence of 
a distance metric, and are therefore more widely-applicable than spatia 
1 search structures. For recent surveys of search structures for multi-d 
imensional vector spaces and metric spaces, see Gaede et al. (Volker Gae 
de and Oliver Gunther, Multidimensional Access Methods, ACM Computing Su 
rveys, 30, 2, 1998, pp. 170-231.), and Chavez et al. (Edgar Chavez, Gonz 
alo Navarro, Ricardo Baeza-Yates and Jose L. Marroquin, Searching in met 
ric spaces, ACM Computing Surveys 33, 3, 2001, pp. 273-321.). 

The practicality of similarity search, whether it be on metric data or v 
ector data, is limited by an effect often referred to as the 'curse of 
dimensionality'. Recent evidence suggests that for the general problem o 
f computing nearest-neighbor or range queries on high-dimensional data s 
ets, exact techniques are unlikely to improve substantially over a seque 
ntial search of the entire database, unless the underlying distribution 
of the data set has special properties, such as a low fractal dimension, 
low intrinsic dimension, or other properties of the distribution. 

For more information regardinjg data dimension and the curse of dimension 
ality, see (for example) Chavez et al. (op cito)), Pagel et al. (Bernd-U 
we Pagel, Flip Korn and Christos Faloutsos, Deflating the dimensionality 
curse using multiple fractal dimensions, Proc. 16th International Confe 
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rence on Data Engineering (ICDE 2000), San Diego, USA, IEEE CS Press, 20 
00, pp. 589-598.) , Pestov (Vladimir Pestov, On the geometry of similarit 
y search: dimensionality curse and concentration of measure, Information 
Processing Letters, 73, 2000, pp. 47-51.), and Weber et al. (Roger Webe 
r, Hans-J. Schek and Stephen Blott, A quantitative analysis and performa 
nee study for similarity-search methods in high-dimensional spaces, Proc 
. 24th VLDB Conference, New York, USA, 1998, pp. 194-205). 

C. Brief Survey of Approximate Similarity Searching 

In an attempt to circumvent the curse of dimensionality, researchers hav 
e considered sacrificing some of the accuracy of similarity queries in t 
he hope of obtaining a speed-up in computation. Details of these techniq 
ues are provided elsewhere, for example, by Indyk. et al. (P. Indyk and R 
. Motwani, Approximate nearest neighbors: towards removing the curse of 
dimensionality, Proc. 30th ACM Symposium on Theory of Computing, Dallas, 
1998, pp. 604-613.), and Ferhatosmanoglu et al. (Hakan Ferhatosmanoglu, 
Ertem Tuncel, Divyakant Agrawal and Amr El Abbadi, Approximate nearest 
neighbor searching in multimedia databases, Proc. 17th International Con 
ference on Data Engineering (ICDE) , Heidelberg, Germany, IEEE CS Press, 
2001, pp. 503-514.); for metric spaces, by Ciaccia et al. (Paolo Ciaccia 
J- and Marco Patella, PAC nearest neighbor queries: approximate and contro 

lied search in high-dimensional and metric spaces, Proc. 16th Internatio 
nal Conference on Data Engineering (ICDE 2000), San Diego, USA, 2000, pp 
. 244-255; Paolo Ciaccia, Marco Patella and Pavel Zezula, M-tree: an eff 
icient access method for similarity search in metric spaces, Proc. 23rd 
VLDB Conference, Athens, Greece, 1997, pp. 426-435.) and Zezula et al. ( 
Pavel Zezula, Pasquale Savino, Giuseppe Amato and Fausto Rabitti, Approx 
imate similarity retrieval with M-trees, The VLDB Journal, 7, 1998, pp. 
275-293.). However, these methods all suffer from deficiencies that limi 
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t their usefulness in practice. Some make unrealistic assumptions concer 
ning the distribution of the data; others cannot effectively manage the 
trade-off between accuracy and speed. 

D. Spatial Approximation Sample Hierarchy (SASH) 

An approximate similarity search structure for large multi-dimensional d 
ata sets that allows significantly better control over the accuracy-spee 
d tradeoff is the spatial approximation sample hierarchy (SASH), describ 
ed in Houle (Michael E. Houle, SASH: a spatial approximation sample hier 
archy for similarity search, IBM Tokyo Research Laboratory Research Repo 
rt RT-0446, 18 pages, February 18, 2002) and Houle, Kobayashi and Aono ( 
Japanese Patent Application No. 2002-037842). The SASH requires a simila 
rity function satisfying the conditions of a distance metric, but otherw 
ise makes no assumptions regarding the nature of the data. Each data ele 
ment is given a unique location within the structure, and each connectio 
n between two elements indicates that they are closely related. Each lev 
el of the hierarchy consists of a random sample of the elements, the sam 
pie size at each level roughly double that of the level immediately abov 
e it. The structure is organized in such a way that the elements located 

closest to a given element v are those that are most similar to v. In p 
articular, the node corresponding to v is connected to a set of its near 

neighbors from the level above, and also to a set of items from the lev 
el below that choose v as a near neighbor. 

E. Review of Clustering Techniques 

The term clustering refers to any grouping of unlabeled data according t 
o similarity criteria. Traditional clustering methods can generally be c 
lassified as being either partitional or hierarchical. Hierarchical tech 
niques produce a tree structure indicating inclusion relationships among 
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groups of data (clusters), with the root of the tree corresponding to t 
he entire data set. Partitional techniques typically rely on the global 
minimization of classification error in distributing data points among a 

fixed number of disjoint clusters. In their recent survey, Jain, Murty 
and Flynn (A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: a r 
eview, ACM Computing Surveys 31, 3, 1999, pp. 264-323.) argue that parti 
tional clustering schemes tend to be less expensive than hierarchical on 
es, but are also considerably less flexible. Despite being simple, fast 
(linear observed time complexity), and easy to implement, even the well- 
known partitional algorithm K-means and its variants generally do not pe 
rform well on large data sets. Partitional algorithms favor the generati 
on of isotropic (rounded) clusters, but are not well-suited for finding 
irregularly-shaped ones. 

F. Hierarchical agglomerat ive clustering 

In a hierarchical agglomerat ive clustering, each data point is initially 
considered to constitute a separate cluster. Pairs of clusters are then 
successively merged until all data points lie in a single cluster. The 
larger cluster produced at each step contains the elements of both merge 
d subclusters; it is this inclusion relationship that gives rise to the 
cluster hierarchy. The choice of which pairs to merge is made so as to m 
imimize some inter-cluster distance criterion. 

G. Shared-neighbor methods 

One of the criticisms of simple distance-based agglomerative clustering 
methods is that they are biased towards forming clusters in regions of h 
igher density. Well-associated groups of data in regions of low density 
risk not being discovered at all, if too many pairwise distances fall be 
low the merge threshold. More sophisticated (and expensive) distance mea 
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sures for agglomerat ive clustering have been proposed, that take into ac 
count the neighborhoods of the data elements. Jarvis et al. (R. A. Jarvi 
s and E. A. Patrick, Clustering using a similarity measure based on shar 
ed nearest neighbors, IEEE Transactions on Computers C-22, 11, Nov. 1973 
, pp. 1025-1034.) defined a merge criterion in terms of an arbitrary sim 
ilarity measure dist and fixed integer parameters k > r > 0, in which tw 
o data elements find themselves in the same cluster if they share at lea 
st a certain number of nearest neighbors. The decision as to whether to 
merge clusters thus does not depend on the local density of the data set 
, but rather as to whether there exists a pair of elements, one drawn fr 
om each, that share a neighborhood in a substantial way. 

Jarvis and Patrick' s method (op. cito) is agglomerat ive, and resembles 
the single-link method in that it tends to produce irregular clusters vi 
a chains of association. More recent variants have been proposed in an a 
ttempt to vary the qualities of the clusters produced: for example, by G 
uha et al. (S. Guha, R. Rastogi and K. Shim, ROCK: a robust clustering a 
lgorithm for categorical attributes, Information Systems 25, 5, 2000, pp 
. 345-366.); by Ertoz et al. (Levent Ertoz, Michael Steinbach and Vipin 
Kumar, Finding topics in collections of documents: a shared nearest neig 
hbor approach, University of Minnesota Army HPC Research Center Preprint 

2001-040, 8 pages, 2001.); by Ertoz et al. (Levent Ertoz, Michael Stein 
bach and Vipin Kumar, A new shared nearest neighbor clustering algorithm 

and its applications, Proc. Workshop on Clustering High Dimensional Dat 
a and its Applications (in conjunction with 2nd SIAM International Confe 
rence on Data Mining), Arlington, VA, USA, 2002, pp. 105-115.); by Dayli 
ght Chemical Information Systems Inc., in URL address (http: //www. dayl ig 
ht.com/); and by Barnard Chemical Information Ltd., in URL address (http 
: //www. be i .gb.com/) . Nonetheless, all variants still exhibit the main ch 
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aracter istics of agglomerat ive algorithms, in that they allow the format 
ion of large irregularly-shaped clusters with chains of association brid 
ging poorly-associated elements. 

H. Review of Methods for Dimension Reduction 

Latent semantic indexing (LSI) is a vector space model-based algorithm f 
or reducing the dimension of the document ranking problem; see Deerweste 
r et al; (Scott Deerwester, Susan T. Dumais, George W. Furnas, Richard H 
arshman, Thomas K. Landauer, Karen E. Lochbaum, Lynn A. Streeter, Comput 
er information retrieval using latent semantic analysis, U.S. Patent No. 

4839853, filed Sept. 15, 1988, issued June 13, 1989; Scott Deerwester, 
Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, 

Indexing by latent semantic analysis, Journal of the American Society f 
or Information Science, 41, 6, 1990, pp. 391-407.). LSI reduces the retr 
ieval and ranking problem to one of significantly lower dimension so tha 
t retrieval from very large databases can be performed more efficiently. 

Another dimension-reduction strategy due to Kobayashi et al. (Mei Kobay 
ashi, Loic Malassis, Hikaru Samukawa, Retrieval and ranking of documents 

from a database, IBM Japan, docket No. JP9-2000-0G75, filed June 12, 20 
00; Loic Malassis, Mei Kobayashi, Statistical methods for search engines 
, IBM Tokyo Research Laboratory Research Report RT-413, 33 pages, May 2, 

2001.) provides a dimensional reduction method called C0V, which uses t 
he covariance matrix of the document vectors to determine an appropriate 

reduced-dimensional space into which to project the document vectors. L 
SI and C0V are comparable methods for information retrieval; for some da 
tabases and some queries, LSI leads to slightly better results than C0V, 

while for others, C0V leads to slightly better results. 

[Problem to be Solved by Invention] 
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Conventional cluster detection based on distances has other inconvenienc 
es as described herein below: 

The usual clustering methods for machine learning contexts are designed 
to find major groupings within data sets. Here, a method is considered g 
ood if the clusters allow unknown points to be classified with high accu 
racy. However, in data mining contexts, the major clusters of the data a 
re often well understood by users, and it is the smaller, minor clusters 
that have the potential of revealing valuable nuggets of information. E 
xisting clustering techniques based on partition or agglomeration are la 
rgely ineffective at separating out small data clusters from their backg 
round. 

There is another inconvenience that massive text databases are typically 
partitioned into smaller collections in order to increase the efficienc 
y of information retrieval operations. This distribution is usually perf 
ormed so that the largest clusters in the data set remain intact within 
a single database. However, partition methods that focus on major cluste 
rs may cause valuable minor clusters to be dispersed among several datab 
ases. Identifying minor clusters as well as major clusters can lead to p 
artitions that more effectively preserve minor clusters. 

As described before, some users of clustering tools are often interested 
in knowing the relationships among the clusters produced by the tool. H 
ierarchical clustering algorithms attempt to fill this need by producing 
a nested collection of clusters, with a single cluster containing the e 
ntire data set at the top, and the smallest clusters at the bottom. Howe 
ver, many of these clusters may exist only as a byproduct of the hierarc 
hical organization, and have no useful interpretation of their own. User 
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s would primarily expect each cluster reported by a data mining tool to 
have some independent conceptual interpretation. Once a set of meaningfu 
1 clusters has been identified, users would likely be interested in know 
ing of any overlap or inclusion relationships among them. 

In addition, in multi-dimensional settings it is very difficult to repre 
sent or describe the associative qualities of data clusters in a way tha 
t is easy for users to understand. When browsing clustered data, users n 
eed to be able to assess the degree of cohesion and prominence of cluste 
rs at a glance. 

With respect to hardware resources for the retrieval, clustering has gen 
erally been viewed as desirable yet impractical for data mining applicat 
ions, due to the computation cost associated with achieving high-quality 

clusters when the data sets are very large. There is a tremendous deman 
d for tools that can provide some insight into the organization of large 

data sets in reasonable time on an ordinary computer. 

As described above, many methods has been proposed so far. Nevertheless, 
a novel data structure suitable for information retrieval with high eff 
iciency, high speed together with sufficient scalability has been requir 
ed in the art. 

[Means to Solve Problem] 

The present invention hereby proposes a system and a method for informat 
ion retrieval and data mining of large text databases, based on the iden 
tification of clusters of elements (e.g. documents) that exhibit a high 
degree of mutual similarity relative to their background. 
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In the present invention, profiles of clusters can be graphically displa 
yed to the user, providing immediate visual feedback as to their quality 
and significance. Cluster attributes such as size and quality are asses 
sed automatically by the system. The system also allows users to query t 
he data set for clusters without the need for a precomputed global clust 
ering. Scalability is achieved by means of dimensional reduction techniq 
ues, random sampling, and the use of data structures supporting approxim 
ate similarity search. 

The present invention provides the above-described novel information ret 
rieval features by improving detection efficiency of minor clusters whil 
e preserving such minor clusters. The novel information retrieval accord 
ing to the present invention allows the interrelations of the clusters t 
o be expressed as a graph structure to aid user understanding of the clu 
sters. The present invention further makes it possible to improve the co 
mputation scalability of the computation of information retrieval. 

The above aspects are provided by a system and methods for information r 
etrieval and data mining of text databases, using shared neighbor inform 
at ion to determine query clusters. The clustering method assesses the le 
vel of mutual association between a query element (which may or may not 
be an element of the data set) and its neighborhood within the data set. 

The association between two elements is considered strong when the elem 
ents have a large proportion of their nearest neighbors in common. In co 
ntrast with previous methods making use of shared-neighbor information, 
the proposed methods are based on the new and original concepts of inter 
-cluster association confidence (CONF) and intra-cluster association sel 
f-confidence (SCONF) . 
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According to the present invention, a computer system is provided for ge 
nerating data structures for information retrieval of documents stored i 
n a database, the documents being stored as document-keyword vectors gen 
erated from a predetermined keyword list, and the document-keyword vecto 
rs forming nodes of a hierarchical structure imposed upon the documents. 

The computer system comprises: 
a neighborhood patch generation part for generating groups of nodes havi 
ng similarities as determined using a search structure, the patch genera 
tion part including a part for generating a hierarchical structure upon 
the document-keyword vectors and a patch defining part for creating pate 
h relationships among said nodes with respect to a metric distance betwe 
en nodes; and 

a cluster estimation part for generating cluster data of the document-ke 
yword vectors using the similarities of patches. 

According to the present invention, the computer system comprises a conf 
idence determination part for computing inter-patch confidence values be 
tween the patches and intra-patch confidence values, and the cluster est 
imation part selects the patches depending on the inter-patch confidence 
values to represent clusters of the document-keyword vectors. 

According to the present invention, the cluster estimation part estimate 
s sizes of the clusters depending on the intra-patch confidence values. 

According to the present invention, a method is provided for generating 
data structures for information retrieval of documents stored in a datab 
ase, the documents being stored as document-keyword vectors generated fr 
om a predetermined keyword list, and the document-keyword vectors formin 
g nodes of a hierarchical structure imposed upon the documents. The meth 
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od comprises the steps of: 

generating a hierarchical structure upon the document-keyword vectors, a 
nd storing hierarchy data in an adequate storage area; 
generating neighborhood patches consisting of nodes having similarities 
as determined using levels of the hierarchical structure, and storing th 
e patches in an adequate storage area; 

invoking the hierarchy data and the patches to compute inter-patch confi 
dence values between the patches and intra-patch confidence values, and 
storing the values as corresponding lists in an adequate storage area; a 
nd 

selecting the patches depending on the inter-patch confidence values and 
said intra-patch confidence values to represent clusters of the documen 
t-keyword vectors. 

According to the present invention, a program may be provided for making 
a computer system execute a method for generating data structures for i 
nformation retrieval of documents stored in a database, the documents be 
ing stored as document-keyword vectors generated from a predetermined ke 
yword list, and the document-keyword vectors forming nodes of a hierarch 
ical structure imposed upon the documents. The program makes the compute 
r system execute the steps of : 

generating a hierarchical structure upon the document-keyword vectors an 
d storing hierarchy data in an adequate storage area; 

generating neighborhood patches consisting of nodes having similarities 
as determined using levels of the hierarchical structure, and storing th 
e patches in an adequate storage area; 

invoking the hierarchy data and the patches to compute inter-patch confi 
dence values between the patches and intra-patch confidence values, and 
storing the values as corresponding lists in an adequate storage part; a 
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nd 

selecting the patches depending on the inter-patch confidence values and 
intra-patch confidence values to represent clusters of the document-key 
word vectors. 

According to the present invention, a computer readable medium may be pr 
ovided for storing a program for making a computer system execute a meth 
od for generating data structures for information retrieval of documents 
stored in a database, the documents being stored as document-keyword ve 
ctors generated from a predetermined keyword list, and the document-keyw 
ord vectors forming nodes of a hierarchical structure imposed upon the d 
ocuments. The program makes the computer system execute the steps of: 
generating a hierarchical structure upon the document-keyword vectors an 
d storing hierarchy data in an adequate storage area; 

generating neighborhood patches consisting of nodes having similarities 
as determined using levels of the hierarchical structure, and storing th 
e patch list in an adequate storage area; 

invoking the hierarchy data and the patches to compute inter-patch confi 
dence values between the patches and intra-patch confidence values, and 
storing the values as corresponding lists in an adequate storage area; a 
nd 

selecting the patches depending on the inter-patch confidence values and 
intra-patch confidence values to represent clusters of the document-key 
word vectors. 

According to the present invention an information retrieval system may b 
e provided for documents stored in a database, the documents being store 
d as document-keyword vectors generated from a predetermined keyword lis 
t, and the document-keyword vectors forming nodes of a hierarchical stru 
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cture imposed upon the documents. The system comprises: 
a neighborhood patch generation part for generating groups of nodes havi 
ng similarities as determined using a hierarchical structure, the patch 
generation part including a part for generating a hierarchical structure 

upon the document-keyword vectors and a patch defining part for creatin 
g patch relationships among said nodes with respect to a metric distance 

between nodes; and 

a cluster estimation part for generating cluster data of the document-ke 
yword vectors using the similarities of patches; and 

a graphical user interface part for presenting the estimated cluster dat 
a on a display means. 

According to the present invention, the information retrieval system com 
prises a confidence determination part for computing inter-patch confide 
nee values between the patches and intra-patch confidence values, and th 
e cluster estimation part selects the patches depending on the inter-pat 
ch confidence values to represent clusters of the document-keyword vecto 
rs. According to the present invention, the cluster estimation part esti 
mates sizes of the clusters depending on the intra-patch confidence valu 
es. According to the present invention, the system further comprises a u 
ser query receiving part for receiving the query and extracting data for 

information retrieval to generate a query vector, and an information re 
trieval part for computing similarities between document-keyword vectors 

and the query vector to select the document-keyword vectors. The cluste 
rs are estimated using the retrieved document-keyword vectors with respe 
ct to the user input query. 

According to the present invention, a graphical user interface system fo 
r graphically presenting estimated clusters on a display device in respo 
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nse to a user input query may be provided. The graphical user interface 

system comprising: 

a database for storing documents; 

a computer for generating document-keyword vectors for the documents sto 
red in the database and for estimating clusters of documents in response 

to the user input query; and 
a display for displaying on screen the estimated clusters together with 
confidence relations between the clusters and hierarchical information p 
ertaining to cluster size. 

According to the present graphical user interface, the computer comprise 
s a neighborhood patch generation part for generating groups of nodes ha 
ving similarities as determined using a search structure, the neighborho 
od patch generation part including a part for generating a hierarchical 
structure upon the document-keyword vectors and a patch defining part fo 
r creating patch relationships among the nodes with respect to a metric 
distance between nodes; and 

a cluster estimation part for generating cluster data of the document-ke 
yword vectors using the similarities of patches. Further according to th 
e present invention, the computer comprises a confidence determination p 
art for computing inter-patch confidence values between the patches and 
intra-patch confidence values, and the cluster estimation part selects t 
he patches depending on the inter-patch confidence values to represent c 
lusters of the document-keyword vectors and the cluster estimation part 
estimates sizes of the clusters depending on the intra-patch confidence 
values. 

[Embodiment of the Invention] 
Part I. Essential Processes of the Method 

Hereinafter, the present invention will be explained in the context of i 
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nformation retrieval of documents; however, the present invention is not 
limited thereto and the algorithm of the present invention can be adapt 
ed for any application for which a pairwise dissimilarity measure is use 
d that satisfies the properties of a distance metric (with the possible 
exception of the triangle inequality), and for which each data element h 
as keywords or other information that can be used for annotation purpose 
s. One example of such an application is a data mining system for multim 
edia databases (e.g. , databases with contents which consist of text, aud 
io, video, still images, graphics images, graphics videos, and/or gif an 
imations, etc.) having contents for which such a pairwise dissimilarity 
metric exists. 

A flowchart of the general method according to the present invention is 
shown in Fig. 1. Although the present invention is primarily explained u 
sing an application to for texts, a person skilled in the art may unders 
tand that the methods of the present invention are easily adapted to any 
database with contents which may be modeled with a clearly defined metr 
ic that enables computation of distances between any two elements so tha 
t pairs of elements which are "closer" (with respect to the metric) ar 
e more similar than pairs of elements that are "further apart" . 

The method of the present invention begins from the step S10 where docum 
ents in a database are transformed into vectors using the vector space m 
odel. Next, the method generates in the step S12 a SASH similarity searc 
h structure for the data stored in the database. Next, for every element 

of the database, the SASH structure is used in the step S14 to compute 
a neighborhood patch consisting of a list of those database elements mos 
t similar to it. These patches are then stored in an adequate memory are 
a. 
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In the step S16, a list of self-confidence values, hereafter referred to 
as SCQNF values, are computed for every stored patch. These SCONF value 
s are used to compute relative self-confidence values, hereafter referre 
d to as RSCONF values, that are in turn used to determine the size of th 
e best subset of each patch (which is itself also a patch) to serve as a 

cluster candidate. Next, the method proceeds to the step S18, at which 
confidence values, hereafter referred to as CONF values, are used to eli 
minate redundant cluster candidates. The method then proceeds to the Ste 
p S20 for further selection of those cluster candidates haying at least 
a desired minimum value of RSCONF as the final clusters, and storing the 
se selected clusters in an adequate memory. The method further proceeds 
to the step S22 to display to the user by a GUI interface on a computer 
screen a graph indicating the interrelationships among the clusters. The 
method of Fig. 1 further comprises sub-steps for performing each step o 
f Fig. 1, and the sub-steps will hereinafter be described in detail. 

Computation of Document-Keyword Vectors> 

Document-keyword vectors may be computed from given keywords and documen 
ts using any of several known techniques. In a particular embodiment of 
the present invention, appropriate weighting is used to digitize the doc 
uments; details of the digitization has been provided elsewhere (e.g. Sa 
lton et al., op. cito), and therefore are not explained in the present i 
nvention. 

<SASH Construction and Usage> 

Fig. 2 shows a general procedure for constructing the hierarchical struc 
ture of the document-keyword vectors known as a spatial approximation sa 
mple hierarchy, or SASH. The process begins at the step S28 after receiv 
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ing the result of the step S10 of Fig. 1 to generate a random assignment 
of vectors to nodes of the SASH using, for example, any well-known rand 
om number generating program. The levels are numbered from 0 to h, where 
each level contains roughly twice as many vector nodes as the one folio 
wing it. The level numbered 0 contains roughly half the vector nodes of 
the data set, and the level numbered h contains a single node, called th 
e top node. The top node of the SASH structure is determined randomly us 
ing any random number generation means included elsewhere in the compute 
r system. Next, in the step S30, a hierarchy level reference L is initia 
lized to h. The process proceeds to the step S32 to decrease the hierarc 
hy level L by 1 and in the step S34 level L nodes are connected to a set 
of level L+l nodes depending on distances between the nodes. In the abo 
ve connection, the nodes at level L+l become parent nodes and the nodes 
at level L become child nodes. The connection is performed by choosing p 
arents of a node from level L from among the closest nodes from level L+ 
1, and then connecting these parent-child node pairs so that each parent 
is connected to a predetermined number of its closest children. Further 
details on how the connections are performed are given elsewhere, by Ho 
ule et al. (op. cito). The process proceeds to the step S36 and determin 
es whether or not the hierarchy level reaches to the lowest level (0), a 
nd if so (yes), the construction of the SASH is completed and the SASH s 
tructure is stored in an adequate memory area such as memory or a hard d 
isk. The process continues to the step S38 to construct patches of nodes 
. If not so (no), the process reverts to the step S32 to repeat until an 
affirmative result in the step S36 is obtained. 

In the step 38, the stored SASH structure is used according to the prese 
nt invention to generate a patch for every element of the database. A pa 
tch for a given element q with respect to a subset R of the database is 
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a set of neighboring elements of q drawn from R, according to a predeter 
mined measure of similarity dist. In the described embodiment for constr 
ucting the SASH, each node in the database are labeled with its hierarch 
y level, and the patch for each node is of a predetermined, fixed size, 
and is computed with respect to the set of all nodes at the same level o 
r greater. The present invention is not limited to constructing and stor 
ing only one patch per node; additional patches with respect to other no 
de sets may also be constructed and stored. 

Fig. 3 shows an illustrative example of construction of the SASH structu 
re together with the structure of the patch created according to the pre 
sent invention. As described in Fig. 3, the vector nodes referred to by 
a patch can essentially belong to any of the SASH hierarchy levels at or 
above the level of the vector node upon which it is based. In addition, 
from among the nodes at these hierarchy levels, patches contain the nod 
es closest to the base node according to a predetermined "metric distan 
ce" . The base node may be selected from any or all nodes included in th 
e hierarchical structure so as to provide global constructions of cluste 
rs; in an alternative embodiment of the present invent ion , the base node 
may be determined using a user inputted query so as to provide cluster 
information specifically about the queried base node, i.e., a retrieved 
document. The base node is represented by the star in Fig. 3 and the nod 
es in the patch are aligned with respect to the user query as shown in F 
ig. 3. The patch structure is also stored in an adequate memory area in 
the system described in detail hereinafter. In the present invent ion, th 
ese patches are further related in terms of confidence, as described bel 
ow. 

Computation of Confidences> 
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The method of the present invention uses a novel model for clustering th 
at borrows from both information retrieval and association rule discover 
y herein named the "patch model" . The patch model assumes that data cl 
usters can be represented as the results of neighborhood queries based o 
n elements from the data set, according to some measure of (dis)simi lar i 
ty appropriate to the domain. More formally, let S be a database of elem 
ents drawn from some domain D, and now, let "dist" be a pairwise dista 
nee function defined on D satisfying the properties of a metric, as defi 
ned earlier. Further now, let R be a subset of S. For any given query pa 
ttern q £ D, let NN(R, q, k) which denote a k-nearest neighbor set of q 
, drawn from R according to dist, and chosen subject to the following co 
ndi tions: 

If q G R, then NN(R, q, 1) = {q} , that is, if q is a member of the data 
set, then q is considered to be its own nearest neighbor. 
NN(R, q, k-1) C NN(R, q, k) for all 1 < k £ |R|, that is, smaller neig 
hborhoods of q are strictly contained in larger neighborhoods. 

These conditions take into account the possibility that q may have more 
than one distinct k-nearest neighbor set in R. The uniquely-determined s 
et NN(R, q, k) is referred as the k-patch of q (relative to R) , or simpl 
y as one of the patches of q. 

Fig. 4 illustrates a collection of patches (a 7-patch, a 12-patch, and a 
n 18-patch) of a database. The dashed circle represents the entire docum 
ent space. 

Consider now the situation in which two potential clusters within R are 
represented by the two patches Cj = NN(R, qj, kj) and Cj = NN(R, q jf k.) 
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. The relevance of Cj to Cj is assessed according to a natural confidenc 
e measure resembling that of association rule discovery proposed by Agra 
wal and Srikant (op. cito): 

C0NF(Cj,Cj) = |C. PI Cj l/ICj | = |NN(R, qj , kj) f] NN(R, q j , k^l/k.. 

That is, the confidence is expressed as the proportion of elements formi 
ng Cj that also contribute to the formation of Cj. If the confidence val 
ue is small, the candidate Cj has little or no impact upon on the ot 
her hand, if the proportion is large, Cj is strongly related to Cj, poss 
ibly even subsuming it. 

Fig. 5 shows an essential function of CONF to the clusters A and B which 
include 8 and 10 vectors, respectively. Two vectors are in the common i 

ntersection of A and B, and therefore when the function CONF is applied 

to the patches in the order A, B, that is, CONF (A, B) , the result is 0. 

25 or 25%. When the function is applied in the order B, A, that is, CONF 
(B, A), the result is 0.2 or 20%. The function CONF can be applied to a 

ny two patches drawn from a common underling sample of the database. 

The confidence measure can also be regarded as an example of a shared-ne 
ighbor distance metric. However, the uses to which the shared-neighbor i 
nformation are put in this invention are very different from those of ag 
glomerative clustering methods: whereas agglomerat ive methods use such m 
etrics to decide whether two patches should be merged, the proposed meth 
od uses it to assess the quality of the level of association between two 
query patches. 

Computation of Intra-Cluster Association> 
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A natural assessment of association within patches is also possible in t 
erms of the notion of confidence. Let C q = NN(R, q, k) be a patch cluste 
r candidate. Here the constituent patches of C is defined to be the set 

q 

of those patches of the form C y = NN(R, v, k) , for all elements v C q 
. If C q has a high degree of internal association, then one can reasonab 
ly expect strong relationships between C q and its constituent patches. 0 
n the other hand, low internal association would manifest itself as weak 

relationships between C q and its constituent patches. Therefore, intern 
al association within a patch cluster candidate in terms of its self-con 
fidence is obtained and is defined as the average confidence of the cand 
idate patches with respect to its constituent patches: 

SC0NF(C q ) = (1 / |C q |) * Z v e cq> | CvHCq | C0NF(C q ,C v ) 
= (1 / k 2 ) * S v e Cq |NN(R, q, k) 0 NN(R, v, k) | . 

A self-confidence value of 1 indicates perfect association among all ele 
ments of a cluster, whereas a value approaching 0 indicates little or no 
internal association. 



<Cluster Boundary Determination Using Intra-cluster Confidence> 
Herein assume for the moment that the subject node q is associated with 
some cluster within R that we want to estimate. Using the notion of self 
-confidence, the process determines the k-patch based at q that best des 
cribes this cluster, over some range of interest a ^ k b. The ideal 
patch would be expected to consist primarily of cluster elements, and to 
have a relatively high self-confidence, whereas larger patches would be 
expected to contain many elements from outside the cluster and to have 
a relatively low self-confidence. The evaluation focuses on two patches: 
an inner patch C i= NN(R, q, k) of size k indicating a candidate pate 



OiIiE# 2003-3037014 



#2 002 — 368276 



h cluster, and an outer patch C q ^ ^ = NN(R, q, 4> (k)) of size $ (k) 
> k that provides the local background against which the suitability of 
the inner patch will be judged. 

For a given choice of k, the neighbor sets of each element of the outer 
patch are examined. Consider the neighbor pair (v,w) with v in the outer 
patch, and w a member of the outer constituent patch NN(R, v, (k)). I 
f v also lies in the inner patch, and w is a member of the inner const it 
uent patch NN(R, v, k) , then herein (v,w) is referred to as an inner nei 
ghbor pair. 

If w is a member of the outer patch, then the pair (v,w) contributes to 
the self-confidence of the outer patch, thereby undermining the choice o 
f the inner patch as the descriptor for the cluster based at q. If w is 
also a member of the inner patch, and (v,w) is an inner pair, then the p 
air contributes to the self-confidence of the inner patch, thereby stren 
gthening the association between v and q. 

Essentially, the k-patch best describing the cluster containing q would 
achieve as below: 

i) a high proportion of inner pairs that contribute to the self-conf iden 
ce of the inner patch, and 

ii) a high proportion of neighbor pairs (not necessarily inner) that do 
not contribute to the self-confidence of the outer patch. 

A high proportion of the former kind indicates a high level of associati 
on within the k-patch, whereas a high proportion of the latter kind indi 
cates a high level of differentiation with respect to the local backgrou 
nd. As both considerations are equally important, these proportions shou 
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Id be accounted for separately. The above considerations has been taken 

into account by maximizing, over all choices of k in the range a ^ k ^ 

b, the sum of these two proportions: that is, SC0NF(C i_) and 1 - SCONF 

q, K 

(C q,* 00 >V 

The relative self-confidence maximization (RSCM) problem can thus be for 
muiated as follows: 

max a <i k ^ b RSC0NF(C q>k , 
where 

RSC0NF(C q>k , <p) = SCONF (C q>k ) - SCONF (C q> 0 (k) ) 
= SC0NF(NN(R, q, k)) - SC0NF(NN(R, q, tf>(k))), 

wherein RSCONF is referred to as the relative self-confidence of the k-p 
atch C , with respect to R and <p . The k-patch at which the maximum is 
attained shall be referred to as the query cluster of q over this range. 
RSCM can be viewed as a form of maximum likelihood estimation (MLE) , in 
which neighbor pairs are classified as either supporting or not support 
ing the choice of the inner patch as the query cluster. 

Fig. 6 shows a sample pseudo-code listing for computing SCONF included i 
n the method of the present invention as part of a patch profile of a qu 
ery element, assuming that the neighbor lists NN(R, q, <t> (b)) and NN(R, 
v, <f> (b)) are already available for all v c NN(R, q, <t> (b)). Instead of 
producing SC0NF(NN(R, q, k)) via direct computation, it is obtained from 
SC0NF(NN(R, q, k-1)) by computing the differential resulting from the e 
xpansion of the patch by one item. 

In the present invention, the RSCM method as presented allows for many v 
ariations in the way the outer patch size depends on the value of k (k i 
s integer.). Although the simple choice <p (k) = 2k is ideal in that it p 
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rovides the best balance between membership and non-membership of outer 
patch elements with respect to the inner patch, other considerations may 

influence the choice of $ (k) . For example, the cost of computing bound 
ary sharpness values may encourage the use of a maximum patch size m < 2 
b. In this case, the outer patch size could be chosen to be 4> (k)=min{2k 
, m} , provided that the smallest ratio m/b between outer and inner patch 

sizes is still substantially greater than 1. 

In the present invention, the design of the RSCM method assumes that int 
ernal cluster association is equally important as external dif f erent iati 
on. However, in the present invention, different weightings can be given 
to the internal and external contributions to the relative self-confide 
nee value; that is, one can instead maximize functions of the form 

RSCONF' (C q ^,#) = Wj SC0NF(C q>k ) - w g SCONF(C q>0 (k) ) , 

for real-valued choices of weights 0 < w^ and 0 < Wg. 

In the present stage, each stored patch C = NN(R, q, m) is associated 

q , m 

with a list of self-confidence values SC0NF(C„ h ) for each sub-patch C 

q , K q 

k = NN(R, q, k) of C q m , for all values of k in the range 1 ^ k m # 
The data constructions hereinafter referred to as the SCONF list, shown 
in Fig. 7, may be recorded in an adequate storage means such as a hard d 
isk or a memory to be referred to by the cluster selection function of t 
he present invention. 

Further variation of the present invention is to save the cost of comput 
ation. The cost of computing RSCONF values grows quadrat ical ly as the ma 
ximum outer patch size increases. This cost restricts the size of cluste 
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rs that can be discovered in practice using the RSCM method directly on 
the full data set. However, these restrictions can be circumvented throu 
gh the use of random sampling techniques. Instead of accommodating large 

clusters by adjusting the limits of the range a ^ k ^ b over which th 
e RSCM problem is solved, one can instead search for patches of sizes in 

a fixed range, taken relative to a collection of data samples of varyin 



To understanding the above variation, the relationship between a uniform 

random sample R CI s and a hypothetical query cluster NN(S, q, c) , for 
some large value of c. The intersection of NN(S, q, c) and R produces a 
patch NN(R, q, k), where k = |NN(S, q, c) fl R|. The patch NN(R, q, k) s 
erves as a proxy for NN(S, q, c) with respect to the sample R - the choi 
ce of NN(R, q, k) as a query cluster for q in R can be taken as an indie 
ation of the appropriateness of NN(S, q, c) as a query cluster for q wit 
h respect to the entire data set. 

If a ^ k ^ b, then the proxy patch will be evaluated by the RSCM metho 
d. Otherwise, if k does not lie between a and b, the patch will not be e 
valuated. In terms of the unknown "true" cluster size c, bounds on the 
probability of the proxy patch not being evaluated can be derived using 
standard Chernoff bound techniques, as described (for example) in Motwa 
ni and Raghavan (R. Motwani and P. Raghavan, Randomized Algorithms, Camb 
ridge University Press, New York, USA, 1995.): 



g size. 



E[k] = il = c I R | / |S| 

Pr [k < a | c] ^ e"^ [e^ / (a-1)] 

Pr [k > b | c] ^ [e/c / (b+1)] 



b+1 



a-1 
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One can use these bounds as a guide in choosing appropriate values of a 
and b, as well as a collection of samples of appropriate sizes, so that 
for a desired probability for sufficiently-large c, at least one proxy p 
atch has size between a and b for at least one of the samples. 

As an illustrative example, consider a collection of uniform random samp 
les {R Q , R : , R 2 , ...} such that |R.| = |S| / 2 1 for i ^ 0. Now, let NN( 
Rp q» kj) be the proxy patch of NN(S, q, c) , where c is an unknown valu 
e guaranteed to be at least 25. If the limits a = 25 and b = 120 are cho 
sen, then for at least one sample R., the expected size #j = E[k.] of i 
ts proxy patch must lie in the range 44 ^ ft j ^ 88. Applying the bound 
s stated above, when ix x is restricted to this range, the probability of 
NN(Rj, q, kj) failing to be evaluated by the RSCM method is estimated t 
o be low (less than 0.004285). 

In other words, for this choice of range and samples, the probability th 
at none of the proxy patches are evaluated is less than 1 in 233. This e 
rror bound is quite conservative - in practice, the probability of failu 
re would be far smaller. 

Even when the RSCM method promotes a proxy patch NN(Rj, q, kj) as a clus 
ter estimator, there is no precise way of inferring the size of the corr 
esponding cluster in S. However, following the principle of maximum like 
lihood estimation, the value c = E [k] |S| / |R.| at which E [k] = k. cons 
titutes a natural estimate of the true cluster size. The smallest cluste 
r size that can be estimated with respect to sample R. is therefore (a | 
S|) / |R||. 

It should be noted that when the same cluster is detected several times 
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over several different samples, the estimates of the true cluster size m 
ay not agree. Nevertheless in practice, a large RSCONF value will genera 
lly turn out to be a reliable indicator of the presence of a cluster, ev 
en if the size of the cluster cannot be precisely determined. 

< Element Reclassif ication> 

Further in the present invention, by virtue of the proximity of their me 
mbers to a common query element, clusters produced by the RSCM method te 
nd to be much more cohesive than those produced by agglomerat ive cluster 
ing methods, a desirable trait in the context of text mining. In particu 
lar, query clusters are biased towards shapes that are spherical relativ 
e to the pairwise distance metric. 

Although the solution cluster patch for the RSCM problem as a whole exhi 
bits a high level of mutual association relative to others based at the 
same query element, the members of such a cluster may or may not be stro 
ngly associated with the query element itself. Rather, the query element 
merely serves as a starting point from which a mutually well-associated 
neighborhood of the data can be discovered. When the query element is a 
n outlier relative to its associated cluster, or in other situations in 
which a substantial portion of the reported cluster seems composed of ou 
tliers, it may be advantageous to reassess the outer patch elements acco 
rding to a secondary clustering criterion. Such reassessment allows the 
discovery of cohesive clusters with less spherical bias. 

Many methods may be possible for reclassifying the elements in the vicin 
ity of a query cluster. A pseudo-code description of one such variation 
appears in Fig. 8. The process described in Fig. 8 is given below: 
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i) Given the inner k-patch that determined the original query cluster, a 
11 members of the corresponding outer patch are reassessed according to 
the actual number of k-nearest neighbors shared with the query element. 
In particular, every v ^ NN(R, q, (k)) is ranked according to the con 
fidence value C0NF(C q , C y ) , where C q = NN(R, q, k) and C y = NN(R, v, k) , 

from highest to lowest (ties are broken according to distance from q) . 

ii) The k elements having highest score can be reported as the new, adju 
sted cluster; alternatively, the entire ranking of the outer patch eleme 
nts can be reported, and the user left to judge the final cluster member 
ship. In this way, elements outside the original inner patch yet inside 
the outer patch are eligible for inclusion in the new cluster, provided 
they have a high number of original patch members among their nearest ne 
ighbors. 

<Selection of Clusters> 

The proposed total clustering strategy, the function PatchCluster constr 

ucts a query cluster relationship (QCR) graph drawn from a collection of 
uniform random samples {R Q , R^ R 2 , ...} such that Rj d Rj for all j < 
i and |Rj| = ceil(|S|/2 i ) for 0 ^ i < log 2 |S|. The graph structure de 

pends on several parameters resembling the confidence and support thresh 

olds used in association rule generation: 

i) (cluster quality) a minimum threshold a on the relative self-confide 
nee of clusters; 

ii) (cluster differentiation) a maximum threshold 0 on the confidence b 
etween any two clusters of roughly the same size (drawn from a common sa 
mple Rj ) ; 

iii) (association quality) a minimum threshold y on the confidence betw 
een associated clusters (not necessarily drawn from a common sample); 

iv) (association scale) a maximum threshold 5 on the difference in seal 
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e between two associated clusters (that is, the difference |i-j|, where 
Rj and Rj are the samples from which the clusters derive). 

Fig. 9 shows a sample pseudo-code description of the Patchcluster method 

used in the present invention. The basic QCR construction strategy can 
be summarized as follows: 

l.QCR node set: 

For each 0 ^ t < logg |S|, from the elements of sample R t , generate a c 
ollection of query clusters QC t = {C^, Cg, C| Rt |} , with each cluste 

r Cj = NN(R t , qj, kj) based at a different query element of R t , and a ^ 
| C j | ^ b. Choose the membership of QC t in greedy fashion from among th 
e available query clusters according to RSCONF values, where i < j RS 
CONFCCj) ^ RSCONF(Cj), subject to two conditions: 

1. (cluster differentiation) max {C0NF(Cj ,C j) ,C0NF(Cj ,Cj)} < £ for all 
1 £ i < j £ m t . ; 

ii. (cluster quality) RSCONF (Cj) for all 1 £ i £ |R t |. 

These clusters become the nodes of the QCR graph, at level t. 

2. QCR edge set: 

For each pair of distinct query clusters Cj = NN(R., qj, kj) in QCj and 
Cj = NN(R j , qj, kj) in QCj such that i ^ j ^ i+S , insert directed edg 
es (Cj ,Cj) and (Cj,Cj) into the QCR graph if max {CONF(C i ,C j),CONF(C 
j,Cj)} ^ r, where C } = NN(R- ,qj ,2^ 1 k j) . Apply the values CONF(Cj,C 
' j) and CONF(C j,Cj) as weights of edges (Cj,Cj) and (Cj,Cj), respecti 
vely. 

Each level of the graph can be viewed as a rough slice of the set of clu 
sters, consisting of those with estimated sizes falling within a band de 
pending upon the level, and upon a and b. Within each slice, candidates 
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are chosen greedily according to their RSCONF values, with new candidate 
s accepted only if they are sufficiently distinct from previously-accept 
ed candidates. 

In the present invention, although duplicate clusters occurring at a com 
mon level are eliminated, duplicate clusters are tolerated when they occ 
ur at different levels. The QCR graph can thus contain any given cluster 
only a small number of times. The presence of the same cluster at sever 
al consecutive levels actually improves the connectivity of the structur 
e, as two query clusters sharing a common concept are likely to be deeme 
d to overlap, and thereby be connected by an edge. Fig. 10 shows a sampl 
e pseudo-code listing for eliminating clusters, referred to as the "Pat 
chcluster method" in the present invention. 

By lowering or raising the value of a , users can increase or decrease t 
he number of cluster nodes appearing in the graph. Raising the value of 
£ also increases the number of clusters; however, this comes at the ris 
k of individual concepts being shared by more than one cluster from a gi 
ven sample. Users can vary the value of y to influence the number of gr 
aph edges. For the purpose of navigating the clustering results, high gr 
aph connectivity is desirable. The maximum threshold 6 on the differenc 
e in scale between two associated clusters of the QCR graph should be a 
small, fixed value, for reasons that will be discussed later. 

Another variation of the PatchCluster method involves the control of the 
number of clusters. As described above, the number of clusters produced 
is controlled by specifying a threshold a on the relative self-confide 
nee of the query clusters reported. Instead, the user may be given the o 
ption of determining the number of clusters for each data sample separat 
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ely. For a given level t, this can be done by: 

i) specifying a minimum threshold a t on the relative self-confidence of 
the query clusters to be reported from level t, or 

i i) specif ying a maximum threshold on the absolute number of query cluste 
rs to be reported from level t. 

When a threshold on the number of clusters is given, the greedy selectio 
n of clusters terminates when the desired numbers of clusters have been 
obtained, or when all candidates have been considered (whichever occurs 
first). 

In the Patchcluster method, PatchCluster / RSCM parameters may be determ 
ined depending on the system to which the above described method or algo 
rithm is implemented. The parameters determined are as follows: 

<Inner Patch Size Range> 

The inner patch size range [a, b] should be chosen so as to allow arbitr 
arily-large clusters to be discovered by the method. Although more preci 
se choices of a and b are possible by analyzing the probability of failu 
re (using Chernoff bounds as described earlier), the following general p 
rinciples apply; parameter a should be large enough to overcome the vari 
ation due to small sample sizes. It is recommended that the variable a b 
e no smaller than 20. Parameter b should be chosen such that the ranges 
of cluster sizes targeted at consecutive levels has substantial overlap. 
This is achieved when b is roughly 3 times as large as a, or greater. 

<Maximum Patch Size> 

Also the maximum patch size should be chosen to be as small as possible 
for reasons of efficiency. However, it should be chosen to be substantia 
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lly larger than b. The choice 4> (b) = 2b is ideal; however, the choice 
<f> (b) = 1.25b can also give good results. In the best embodiment of the 
present invention, a = 25, b = 120, and 4> (k) = min {2k, 150} were prefe 
rred because satisfactory results were obtained with many data sets. 

The maximum threshold b on the confidence between any two clusters from 
a common sample should be set to roughly 0,4, regardless of the data set 
. Experimentation showed that overlapping query clusters from a common s 
ample tend either to overlap nearly completely, or only slightly. The cl 
ustering produced by the PatchCluster method is relatively insensitive t 
o the exact choice of b. 

<The Threshold 8 > 

The maximum threshold 5 on the difference in scale between two associat 
ed clusters of the QCR graph should always be a small, fixed value, for 
several reasons. Large values will lead to graphs in which the largest c 
lusters would be connected to an overwhelming number of very small clust 
ers. As a result, the QCR graph would become very difficult for users to 

navigate. For every query cluster from level 0, a neighborhood of the f 
orm NN(Rj,qj,2 kj) would need to be computed. To ensure scalability, S 

must be chosen to be a small constant. The value used in the experiment 

ation, 5=4, allowed association edges to be generated between cluster 

a c; 

s whose sizes differ by at most a factor of roughly 2 to 2 . This choic 
e of 8 is strongly recommended. 

The following parameters should be set by users according to their parti 
cular demands: 

(a)The minimum threshold a on the relative self-confidence of clusters 
(or alternatively, for each sample level, the minimum cluster relative s 
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elf-confidence and/or the maximum number of desired query clusters). Val 
ues in the range 0.1 ^ a ^ 0.2 are recommended; the smaller the value 
, the greater the number of clusters. 

(b) The minimum threshold y on the confidence between associated cluster 
s in the QCR graph (not necessarily drawn from a common sample level). V 
a lues in the range 0.15 ^ y ^0.2 are recommended; the smaller the va 
lue, the greater the number of edges of the graph. 

(c) The number of keyword labels to be applied to each query cluster. 

A further variation of the PatchCluster method is to compute approximate 
neighborhood patches instead of exact ones. The neighborhood computatio 

n performed by the PatchCluster method can be expensive if the number of 
data elements is large and exact neighborhood information is sought. To 
improve the efficiency of the method, approximate neighborhood informat 

ion can be substituted. Similarity search structures such as a SASH can 

be used to generate this information much faster than sequential search 

at high levels of accuracy. 

A further variation of the PatchCluster method variation is to perform d 
imensional reduction of the document-keyword vectors and keyword vectors 

The basic PatchCluster method, as described in Fig. 9, when applied to t 
ext data, assumes that documents have been modeled as vectors using an a 
ppropriate weighting. When the keyword space is large, but the average n 
umber of keywords per document is small, distance computations between v 
ectors can be performed efficiently if the vectors are represented impli 
citly (that is, if only non-zero entries and their positions are stored) 
. However, when the average number of keywords per document is large, di 
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mensional reduction is often performed in order to limit the cost of dis 
tance comparisons. Regardless of the original average number of keywords 
per document, dimensional reduction techniques such as LSI or COV can b 
e applied to the data before clustering, if desired. The experimental re 
suits presented in the Embodiments section show the respective advantage 
s of the use or non-use of dimensional reduction. 

Yet another variation of PatchCluster method variation is possible by in 
corporating QCR graph simplification. The QCR graph produced by the Pate 
hCluster method contains association information for many pairs of clust 
ers. However, this information may sometimes be too dense for users to e 
asily navigate without simplification. Some of the ways in which the gra 
ph could reasonably be simplified are: 

1) (Elimination of transitive edges between levels.) For example, assume 
the graph contains cluster nodes = NN(R U , q^ 9 k^), C 2 = NN(R V , q 2 , k 

2 ) , and Cg = NN(R W , qg, kg), where u < v < w, and association edges (C^, 
C 2 ) , ( C 2 ,C 3^ and ( c i> C 3)- Then ed 2 e (Cj^Cg) can be hidden from the user, 

since he or she would still be able to navigate from to C 2 via (C^,C 
2 ) and (C 2 ,C 3 ). 

ii) (Contraction of similar clusters.) If two clusters = NN(R U , q^, k 
j) and C 2 = NN(R y , q 2 , k 2 ) are deemed to be very similar due to sufficie 
ntly high values of both CONF^j,^) and CONF (C^Cj), then their respec 
tive nodes can be contracted. One of the two nodes is retained and the o 
ther is eliminated (the retained cluster node can be chosen in a variety 
of ways, such as the one with higher RSCONF value, or the one with larg 
er size). Any edges involving the eliminated node are then assigned to t 
he retained node; for example, if is retained and C 2 is eliminated, t 
hen the edge (C 2 , Cg) is converted to (C^Cg). Any duplicate edges that 
result would also be eliminated. Of course, other simplification methods 
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may be adopted in the present invention. 
<Graphical User Interface: Cluster Label ing> 

In order to provide a useful graphical user interface for displaying sea 
rched clusters, the problem of query cluster labeling and identification 

will now be considered in the context of textual data and vector space 
modeling. Since query clusters lie within a restricted neighborhood of a 

single query element, it is tempting to use the query as a descriptor o 
f the cluster, much as in representative-based clustering. However, the 
query element may not necessarily be the best representative for its clu 
ster; indeed, it may be the case that no individual element of the clust 
er adequately describes the whole. 

One common way of assigning labels to a cluster is to use a ranked list 
of terms that occur most frequently within the documents of the cluster, 
in accordance with the term weighting strategy used in the document vec 
tor model. Each term can be given a score equal to the sum (or equivalen 
tly the average) of the corresponding term weights over all document vec 
tors of the clusters; a predetermined number of terms achieving the high 
est scores can be ranked and presented to the user. 

If dimensional reduction techniques such as COV or LSI are being used, t 
he original unreduced document vectors may no longer be available, or ma 
y be expensive to store and retrieve. Nevertheless, meaningful term list 
s can still be extracted even without the original vectors. Note first t 
hat the i-th term can be associated with a unit vector z. = (z. lf z. 9 , 
z i * n the ori &i na l document space, such that z. . = 1, if i = j 
, and z ; . = 0 otherwise. Now, let fi be the average of the document vec 

1 9 J 

tors belonging to the query cluster NN(R, q, k) . Using this notation, th 
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e score for the i-th term can be expressed simply as Zj • n . However, s 

ince || Zj || =1 and n is a constant, ranking the terms according to the 

se scores is equivalent to ranking them according to the measure as belo 
w: 

*i • U>l II V> II =cosangle(Zj, n ) = cos 0 . , 

where 0 j represents the angle between vectors Zj and fx . 

With dimensional reduction, the pairwise distance cosangle(v, w) between 

vectors v and w of the original space is approximated by cdsangle(v' , 
w' ), where v' and w' are the respective equivalents of v and w in the 
reduced dimensional space. Hence we could approximate cosangle(z . , n ) b 
y cosangle(z' j, ), where z' j and are the reduced-dimensional 

counterparts of vectors Zj and /x , respectively. The value cosangle(z' ^ 
9 ) can in turn be approximated by cosangle(z' j, jjl " ) , where fx 99 
is the average of the reduced-dimensional vectors of the query cluster. 
Provided that the vectors z* . have been precomputed for all 1 ^ i ^ d 
, a ranked set of terms can be efficiently generated by means of a neare 
st-neighbor search based on ju, " over the collection of reduced-dimensio 
nal attribute vectors. As d is typically quite small, the cost of such a 
search is negligible compared to the cost of generating the cluster its 
elf. 

The reduced-dimensional cluster labeling method can be summarized as fol 
lows: 

i) For all 1 ^ i ^ N, precompute the reduced-dimensional attribute vec 
tor Zj = (Zj ^ , Zj 2 , . . . , Zj d ) for the i-th attribute. Let W be the se 
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t of reduced-dimensional attribute vectors. 

ii) Compute = S y ^ HH(R q k) v ' w ^ ere v and 3 are taken to be re 
duced-dimensional data vectors. 

iii) If X is the desired number of labels for the cluster, compute the 
X -nearest-neighbors of jti " in W, according to decreasing values of the 

cosangle measure. 

iv) Report the attributes corresponding to the ranked list of X neighbo 
rs as the cluster labels. Optionally, the values of cosangle themselves 
can be displayed to the user. Also optionally, approximate nearest neigh 
bors can be used as produced using a SASH or other similarity search met 
hod. 

Part II A System for Information Retrieval 

Fig. 10 shows a system to which the algorithm of the present invention i 
s implemented. As shown in Fig. 10, the system generally comprises a com 
puter 10, a display device 12, and a input device such as a keyboard 14 
and a pointer device such as an mouse 16 such that a user may input a qu 
ery for information retrieval according to the present invention. The co 
mputer 10 also manages a database 18 for storing documents to be searche 
d. The computer may add new documents to the database 18 and retrieve st 
ored documents therefrom. The computer 10 may be connected to communicat 
ion lines 20 such as LAN or WAN or Internet such as Ethernet (Trade Mark 
), an optical communication, or ADSL with suitable communication protoco 
Is through a hub or router 22. 

When the communication line 20 is assumed to be LAN/ WAN and/or Internet 

locally interconnecting sites of an enterprise, the computer 10 may be 
a server to which inputted queries from clients and/or users are transmi 
tted to execute information retrieval. The server computer 10 retrieves 
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documents with respect to the received query by the algorithm of the pre 
sent invention and returns the retrieved results to the clients that iss 
ued the query. Of course, the present invention may provide the above in 
formation retrieval through the Internet as charged information services 
to registered clients. Alternatively, the computer 10 may be a stand-al 
one system suitably tuned for a particular usage. 

Fig. 11 shows detailed functional blocks implemented in the computer 10. 
The computer 10 generally comprises a vector generation part 24, a SASH 
generation part 36, a confidence determination part 38 for creating the 
SC0NF list, and a patch definition part 26. The vector generation part 
24 executes vector generation using a keyword list or predetermined rule 
s from the documents stored in a database 18, and stores generated docum 
ent-keyword vectors in an appropriate storage area such as a memory or a 
database with adequate links or references to the corresponding documen 
ts. The SASH generation part 36 and the patch definition part 26 constit 
ute the neighborhood patch generation part 34 according to the present i 
nvent ion. 

The SASH generation part 36 constructs the SASH structure using the algo 
rithm shown in Fig. 2 and the generated SASH structure is stored in the 
memory area 30 for the processing described hereinafter in detail. The S 
ASH is made available to a confidence determination part 38 to compute c 
onfidence values such as C0NF, SC0NF, and RSC0NF so as to generate a SCO 
NF list according to the above described algorithm. The generated patch 
data and the confidence values are stored in a hard disk 32, as shown in 
Fig. 7. 

The query vector generation part 46 accepts search conditions and query 
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keywords and creates a corresponding query vector, and stores the genera 
ted query vector in an adequate memory area. The query may be of two typ 
es; one is to extract cluster structures already computed and stored in 
the database 32, and the other is to retrieve cluster structures that ma 
y not yet have been computed and stored. The user input query vector is 
first transmitted to a retrieval part 40. In the described embodiment, t 
he retrieval part analyses the query type. If the query instructs the re 
trieval part 40 to retrieve cluster structures already computed and stor 
ed, the query is performed on the SASH structure stored in the memory ar 
ea 30, and the queried patch generation part 44 transmits the retrieved 
data to the cluster estimation part 28. The cluster estimation part 28 i 
nvokes the patch data and associated SC0NF list from the hard disk 32 up 
on receiving the retrieved data, and performs cluster estimation using i 
ntra-cluster confidences SC0NF and RSC0NF, and inter-cluster confidences 
C0NF, respectively. The nodes used in the queried patch generation part 
44 may be an arbitrarily selected node or a node retrieved by the user 
input query. 

The derived cluster data are transmitted to a GUI data generation part 4 
2 to construct data for graphically presenting the cluster graph structu 
re on a display screen of a display part (not shown). Many display embod 
iments of the cluster graph structure are possible in the present invent 
ion. One representative embodiment is to align the clusters horizontally 
along with significant keywords (for example, the largest numeral value 
s) included in the clusters while aligning the clusters vertically with 
estimated cluster size. When such display is provided on the display scr 
een, the GUI data generation part 42 may sort the cluster data from the 
patch repository 32, and store the sorted data in an adequate memory are 
a therein such as an display buffer (not shown), or elsewhere in the com 
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puter 10. 

In an specific embodiment of the present invention, when the retrieval p 
art 40 determines that the query instructs the retrieval of clusters tha 
t have not already been computed and stored, the retrieval part 40 invok 
es the SASH data 30, and retrieves the appropriate node vectors of the S 
ASH by computing similarities between the document-keyword vectors and t 
he query vectors. The retrieved data vectors are then themselves used as 

queries within the SASH data 30, to obtain a list of similar node vecto 
rs for every vector retrieved by the original query. Each list of node v 
ectors is sent to the patch definition part 26 and thence to the confide 
nee determination part 38 to produce patches, which may then be added to 

the patch repository 32. The retrieved patches are then transmitted to 
the cluster estimation part 28 together with their corresponding SC0NF 1 
ists to estimate the cluster comprising nodes retrieved in the original 
query, and the computed cluster data are transmitted to the GUI data gen 
eration part 42 for graphical presentation of the queried results. 

The GUI data generation part 42 may transmits sorted cluster data to a d 
isplay device (not shown) directly connected to the computer 10 to displ 
ay the searched cluster data on a display screen. Alternatively, when th 
e system provides the searched results via the Internet using a browser 
software, the GUI data generation part 42 generates graphical data of th 
e interrelation of the clusters in a format suitable to the browser soft 
ware, such as an HTML format. 

Part III. Practical Scenarios for Executing Invention 
<Scenario A - Total Clustering of Nodes in Database> 

Fig. 12 shows a flowchart for Scenario A for executing total clustering 
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of the nodes stored in the database. The algorithm of Scenario A first 1 
oads the document and the keyword data in the step S40 and proceeds to t 
he step S42 to generate document-keyword vectors and keyword lists. The 
algorithm executes in the step S46 dimension reduction using LSI or COV 
as described before. Then the process of Scenario A creates in the step 
S46 a SASH of the dimension reduced document-keyword vectors according t 
o the process described in Fig. 2. The data structures generated accordi 
ng to the algorithm shown in Fig. 12 are shown in Fig. 13 according to t 
he stepwise execution of the algorithm. 

Once the SASH structure has been constructed, a similarity query is perf 
ormed based on each of its elements, thereby generating one patch for ea 
ch document, as shown in Fig. 14(a). The algorithm of Scenario A then co 
mputes in the step S48 the optimum patch sizes and RSCONF values for eac 
h patch as shown in Fig. 14 (b) , and then the patches are sorted with re 
spect to their RSCONF values as shown in Fig. 14(c). 

Again referring to Fig. 12, the algorithm of Scenario A proceeds to the 
step S50 to select, at each SASH level, a collection of patches for whic 
h all inter-patch association confidences at that level are less than £ 
= 0.4. Then those patches with RSCONF values larger than or equal to a 
= 0.15 are further selected to determine the clusters, in the step S52. 
The data structures relevant to the steps S46-S52 are shown in Fig. 15. 

Next, the algorithm of Scenario A proceeds to the step S54 to create con 
nections between the clusters having the association confidence values 1 
arger than or equal to the predetermined threshold y . This data structu 
re is shown in Fig. 16. These results of connection together with the cl 
uster labels and corresponding keywords are provided graphically in the 
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step S56 as a graphical representation such as that shown in Fig. 17. 
In Fig. 17, a portion of a cluster graph produced according to Scenario 
A (on an earlier run with y = 0.2) is shown in Fig. 17 for the case in 
which COV dimensional reduction was used. In the figure, cluster nodes ( 
shown as ovals) are marked with a pair of numbers x / y, where x indicat 
es the estimated size of the cluster and y indicates the associated samp 
le patch size. Keyword labels are shown for each cluster - boxes have be 
en drawn around those connected subsets of clusters sharing identical la 
bel sets (with perhaps minor differences in the label ordering). The clu 
ster corresponding to the node marked 106 / 53 is shown in Fig. 17. This 
cluster is particularly interesting, as it consists of news articles in 
the intersection of two larger sets of clusters, involving canyons and 
their development and conservation issues on the one hand, and garbage d 
umps and landfills on the other. 

The detailed procedure included in the processes shown in Fig. 12 are de 
scribed as below: 

i) Model the subset of M documents as vectors, using (for example) the b 
inary model or TF-IDF weighting. The dimension of these vectors is N, th 
e number of attributes of the data set. 

ii) As a further example, apply dimensional reduction of the set of vect 
ors to a number significantly smaller than N (typically 200 to 300), usi 
ng (for example) the COV or LSI reduced-dimensional technique. If dimens 
ional reduction is chosen, then also generate a set of reduced-dimension 
al attribute vectors. 

iii) Construct the SASH structure for handling k-nearest-neighbor querie 
s. Set the random sample R t = S t U S t+1 U ... U S h , where S t is the t 
-th SASH level for 0 ^ t ^ h (here, S Q is taken to be the bottom SASH 
level) . 
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iv) For all 0 S t ^ h, for each element v €E S t , compute and store an 
approximate m-nearest-neighbor list (m-patch) NN(R t , v, m) for that elem 
ent , where m = <t> (b) . 

v) Compute a set of query clusters and a cluster structure graph as outl 
ined in Fig. 16. 

vi) When the dimension reduction is performed, for each query cluster of 
the set, generate a set of attribute keywords from the reduced-dimensio 

nal document vectors that constitute the cluster. 

vii) Make the resulting set of clusters, their sizes and labels, and clu 
ster structure graph available to the user for browsing, using a suitabl 
e user interface. 

<Scenario B - Individual Clusters; Query Search> 

In Scenario B, the same process as in Scenario A may be used to generate 
a SASH. The subsequent essential steps are shown in Fig. 18, and the da 
ta structures generated by the process of Scenario B are shown in Fig. 1 
9 and Fig. 20. As shown in Fig. 18, the process of Scenario B generates 
the SASH structure in the step S60, and proceeds to the step S62 to rece 
ive a user input query q together with a target cluster size k, and stor 
es them in an adequate memory space. Then the nodes in SASH are retrieve 
d with respect to the query using the SASH structure in the step S64. In 
the step S64, the SASH is queries to produce one neighborhood patch for 
the query element q with respect to each of the random samples R t , for 
all 0 ^ t ^ h. Then the process continues to the step S66 to compute R 
SCONF and to solve the RSCM problem with respect to the user input query 
q, for every random sample. For each sample, a cluster is thereby produ 
ced. The process of Scenario B then provides labels, keywords representi 
ng these clusters, in the step S68. The data structures obtained from th 
e step S64 to the step S68 are shown in Fig. 20. 
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The details of the procedures in Scenario B are described below: 

2-i) Repeat the procedures of Scenario A from i to iii. 

2-ii) Prompt the user for a query element q (not necessarily a data elem 

ent) , and a target cluster size k. 

2— iii) Compute = max (t | k / 2 l ^ a} and t b = min {t | k / 2* ^ b} 
. For all t b ^ t ^ t a , compute NN(R t , q, m) , where m = <f> (b) . For all 
v GE NN(R t , q, m) , compute MN(R t , v, m) . 

2-iv) For all t fe ^ t ^ t a , find solutions k(q,t) to the RSCM problems 
for q with respect to R^. 

2-v) For all t fe ^ t ^ t a , generate a set of attribute keywords from th 
e reduced-dimensional document vectors that constitute the query cluster 

NN(R t , q, k(q,t)). The procedure has been described in Fig. 14. 
2-vi) Display the resulting set of clusters, their sizes, their correspo 
nding m-patch SCONF profiles, and their cluster labels to the user. 

[Examples] 

To examine the present invention, the method of the present invention wa 
s implemented as two scenarios as described above. Both scenarios were e 
xamined for the publicly-available L.A. Times news document database ava 
ilable as part of the TREC-9 text retrieval competition. The database co 
nsists of M = 127,742 documents, from which N = 6590 keywords (attribute 
s) were extracted as the attribute set. To examine effectiveness and gen 
eral applicability, the database was subjected to two procedures with an 
d without the dimension reduction (under COV) . The implementation condit 
ions were as follows: 

(a) TF-IDF term weighting on 6590 attributes. 

(b) COV dimensional reduction (from 6590 down to 200 dimensions) in one 
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set of experiments, and no dimensional reduction in another. 

(c) For document nearest-neighbor searches, a SASH with default settings 
(node parent capacity p = 4 and node child capacity c = 16). 

(d) For attribute vector nearest-neighbor searches, a SASH for reduced-d 
imensional attribute vectors with default values (node parent capacity p 

= 4 and node child capacity c = 16). 

For each scenario, it was assumed that parameters <f> , a, b, 0 , and 8 w 
ere set by the system administrator, as well as any parameters associate 
d with dimensional reduction (such as the reduced dimension d) or approx 
imate similarity search. 

The experimental conditions are as follows: 

(a) The choice of patch range delimiters a = 25, b = 120, and <f> (k) = mi 
n {2k, 150} . 

(b) For document nearest-neighbor searches, the use of a time scaling fa 
ctor 

/i' = 1.25 

fi =1.25 <t> (b) 

influencing the accuracy of the approximation. With every search, # ' n 
eighbors are produced, of which the closest m are used (larger values of 
jjl 9 require longer search times but lead to more accurate results). 

(c) A minimum threshold of a = 0.15 on the relative self-confidence 
of clusters. 

(d) A maximum threshold of £ = 0.4 on the confidence between any 
two clusters from a common sample. 

(e) A minimum threshold of y = 0.15 on the confidence between associate 
d clusters in the QCR graph (not necessarily drawn from a common sample 
level) . 
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(f) A maximum threshold of 8 = 4 on the difference in scale between two 
associated clusters of the QCR graph. 

The computation algorithm was written using Java (JDK1.3) and the comput 
ation hardware was an IBM Intel 1 istation E Pro (Trade Mark) with 1GHz pr 
ocessor speed and 512Mb main memory running the Windows 2000 (Trade Mark 
) operating system. 

2-1. Execution time and storage costs 

Although at first glance it would seem that RSC0NF values are expensive 
to compute, with a careful implementation the costs can be kept reasonab 
ly low. This is achieved through the efficient computation of a profile 
of values of SC0NF(NN(R, q, k)) for k ranging from 1 to 4> (b) . Plots of 
patch profiles also provide an effective visual indication of the varyin 
g degrees of association within the neighborhood of a query element. 

i 

The following tables list the time and space costs associated with Scena 
rio A. Time was measured in terms of real seconds of computation, beginn 
ing once reduced-dimensional document and attribute vectors had been loa 
ded into main memory, and ending with the computation of a full set of c 
lusters and their cluster structure graph. The time cost for clustering 
and graph construction assumes that all nearest-neighbor patches have al 
ready been precomputed. 



Table 1 



! STORAGE COSTS (Mb) - Reduced Dimensional Case 


Document SASH Storage 


30.1 


Keyword SASH Storage 


1.6 


NN Patch Storage 


161.6 


Reduced-Dimensional Document Storage 


204.4 


Reduced-Dimensional Keyword Storage 


5.3 


Total Storage 


403 
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Table 2 



TIME COSTS 


No Dim-Reduction 


COV Dim-Reduction 


Document SASH Build Time (s) 


460.7 


898.8 


Keyword SASH BuiJd Time (s) 




26.6 


Total NN Precompution Time (s) 


7,742.9 


13,854.6 


Clustering and Graph 
Construction Time (s) 


126.2 


81.8 


Total Time (s) 


8,329.8 


14,861.8 


Total Time (hr) 


2.3 


4.1 



2-2. Approximate nearest neighbor computation 

The following table shows the average cost of finding approximate m-near 
est-neighbor lists from a full set of M documents, taken over 100 random 
ly-chosen SASH queries of size m' . For comparison purposes, exact queri 
es were also performed using sequential search, and the average accuracy 
of the SASH queries was computed (measured as the proportion of true ne 
arest neighbors in the reported lists). Using these values, one can dete 
rmine the cost of producing a single query cluster directly as per Scena 
rio B. These latter estimates assume the use of the document SASH withou 
t precomputed nearest-neighbor information. 



Table 3 



SASH Performance 


No Dim-Reduction 


COV Dim-Reduction 


Avg SASH Query Dist Computations 


3,039.03 


2,714.57 


Average SASH Query Time (ms) 


38.85 


70.74 


Average SASH Query Accuracy (%) 


62.93 


. 94.27 


Exact NN Query Dist Computations 


127,742 


Exact NN Query Time (ms) 


1,139.19 


2,732.5 


Single Query Cluster Dist Comps 
(xlO 5 ) 


4.59 


4.07 


Single Query Cluster Time (s) 


5.87 


10.68 



2-3. Full query clustering 

An example of a patch profile is illustrated in Fig. 21, for the case in 
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which COV dimensional reduction was used. The profile is associated wit 
h a cluster produced by the Scenario A method. 



The numbers of clusters produced under Scenario A with and with out the 
dimension reduction are listed in the table below. 



Table 4 



Estimated Cluster Size (low - high) 


No Dim-Reduction 


COV Dim-Reduction 


6400 - 30720 


1 


1 


3200 - 15360 


1 


2 


1600 - 7680 


8 


8 


800 - 3840 


15 


25 


400 - 1920 


32 


50 


200 - 960 


70 


84 


100 - 480 


206 


135 


50 - 240 


405 


216 


25 - 120 


760 


356 



The dimensional-reduction variant finds fewer minor clusters compared to 
the basic variant, but more larger clusters. Experimentation also revea 

led that the dimensional-reduction variant produced query cluster graphs 
with richer interconnections, and was better able to resolve keyword po 

lysemies. 



The method according to the present invention may be implemented as a co 
mputer executable program, and the computer program according to the pre 
sent invention may be written in a language such as the C language, the 
C++ language, Java (trade mark), or any other object-oriented language. 
The program according to the present invention may be stored in a storag 
e medium such as a floppy disk (trade mark), a magnetic tape, a hard dis 
k, a CD-ROM, a DVD, a magneto-optic disk or the like whereto data may be 
written and wherefrom data may be read which is readable by a computer. 
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[Effect of Invention] 
Improvement 1: MINOR CLUSTERS 

The proposed clustering method is able to efficiently detect well-associ 
ated and well-differentiated clusters of sizes as low as 0.05% of the da 
tabase, on an ordinary computer. The methods require no a priori assumpt 
ions concerning the number of clusters in the set. The methods also alio 
w clusters to be generated taking only local influences into account. 0v 
erlapping clusters are also permitted. These features allow minor cluste 
rs to be discovered in a way that is impractical or even impossible for 
traditional methods. 

Improvement 2: QUERY-BASED CLUSTERING 

The proposed clustering method can generate meaningful major and minor c 
lusters in the vicinity of a query efficiently, without paying the exces 
sive cost of computing a full clustering of the set. To the best of my k 
nowledge, this is the first practical method for doing so for large text 
databases. 

Improvement 3: AUTOMATIC DETERMINATION of CLUSTER RELATIONSHIPS 
Very few clustering methods allow for the possibility of overlapping clu 
sters. The proposed method uses cluster overlap to establish corresponde 
nces between clusters, and thereby produce a "cluster map" or graph of 

related concepts that can be navigated by the user. Unlike concept hier 
archies, the relationships are established among groups of data elements 

themselves, rather than by classifications within the attribute space. 
Organization according to overlapping clusters of data elements allows f 
or much more flexibility in the concepts that can be represented - in pa 
rticular, minor clusters in the intersection of two or more major cluste 
rs can be discovered using the proposed method. 
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Improvement 4: AUTOMATIC ASSESSMENT of CLUSTER QUALITY 

RSCONF values and patch profiles are techniques that not only serve to i 
dentify and compare clusters, they are also the means by which users can 

assess the level of association within a cluster, and its dif f erentiati 
on with the elements in its vicinity. Patch profiles can effectively com 
plement existing spatial representation methods for the visualization of 

higher-dimensional text clusters. 

Improvement 5: DEPENDENCE on KNOWLEDGE of the DATA DISTRIBUTION 
Unlike most partition-based algorithms, the proposed query-based cluster 
ing method does not require previous knowledge or assumptions regarding 
the distribution of the data - it does not matter whether the data is un 
iformly distributed or has great variations in distribution. This applie 
s even as regards the generation of nearest-neighbor lists, in that thee 
SASH also has this feature. 

Improvement 6: SCALABILITY 

When a SASH structure is used for approximate similarity queries, the as 
ymptotic time required by PatchCluster for a total clustering of data se 

o 

t S is in 0 C 1 S | logg |S| + c^) , where c is the number of clusters produc 
ed (typically much smaller than |S|). The former term covers the cost of 
producing profiles and ranking candidate query clusters according to th 
eir RSCONF values. The elimination of duplicate clusters and the generat 
ion of graph edges can all be performed in 0(|S| + c logg |S| + c^) time 
• 

The bottleneck in the construction of a query cluster graph lies in the 
precomputat ion of nearest-neighbor patches. However, the clustering meth 
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od does not require perfectly-accurate nearest-neighbor lists in order t 
o detect approximate cluster boundaries and overlaps. It is far more cos 
t effective to use one of the emerging techniques, such as the SASH, for 
fast generation of approximately-correct nearest-neighbor lists instead 
. For the L.A. Times news article data set using COV dimensional reducti 
on, the SASH offers speedups of roughly 40 times over sequential search 
at almost 95% accuracy. The asymjptotic complexity of precomputing patche 
s is dominated by the total cost of the SASH operations, which is in 0(| 
S| log 2 |S|). 

Hereinabove, the present invention has been explained using particular e 
mbodiments depicted in the drawings. Of course, it is appreciated by a p 
erson skilled in the art that many alternative embodiments, modification 
s, and/or additions to the disclosed embodiments may be possible and the 
refore, the true scope of the present invention should be determined in 
accordance with the claims herewith. 

[Brief Explanation of Drawings] 

[Fig, 1] A flowchart of the method for constructing data structures acco 
rding to the present invention. 

[Fig. 2] A simplified flowchart of the process for constructing the SASH 
structure. 

[Fig. 3] A schematic construction of the SASH with patch structures. 
[Fig. 4] A sample diagram of the patches according to the present invent 
ion. 

[Fig. 5] A representative example of the computation of the confidence f 
unction CONF. 

[Fig. 6] A sample pseudo-code listing for the computation of SCONFL. 
[Fig. 7] An illustration of the structure of patch and self-confidence s 
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torage. 

[Fig. 8] A sample pseudo-code listing for the refinement of patch profil 
es. 

[Fig. 9] A sample pseudo-code listing for PatchCluster (including patch 
ranking and selection). 

[Fig. 10] A schematic block diagram of a computer system typically used 
in the present invention. 

[Fig. 11] A schematic function block diagram of the computer system acco 
rding to the present invention. 

[Fig. 12] A flowchart of the process for generating the clusters and the 
ir interrelationship graph (Scenario A). 

[Fig. 13] A graphical representation of the data structures relevant to 
the process of Scenario A shown in Fig. 12. 

[Fig. 14] A graphical representation of the data structures relevant to 
the process of Scenario A shown in Fig. 12. 

[Fig. 15] A graphical representation of the data structures relevant to 
the process of Scenario A shown in Fig. 12. 

[Fig. 16] A graphical representation of the cluster interrelationship gr 
aph. 

[Fig. 17] A sample graphical presentation of the interrelationship struc 
ture of clusters. 

[Fig. 18] A flowchart of the process for generating clusters based at a 
single query element (Scenario B) . 

[Fig. 19] A graphical representation of the data structures relevant to 
the process of Scenario B shown in Fig. 18.. 

[Fig. 20] A graphical representation of the data structures relevant to 
the process of Scenario B shown in Fig. 18. 

[Fig. 21] A plot of a profile of SCONF values versus estimated cluster s 
ize. 
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[Explanation of numerals] 
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[Fig. 1] 



generation of document vectors 



construction of SASH structure for 
document vectors 




f ' 



Computation and storage of 
neighborhood patches using the SASH 
structure 



Determination of sizes of cluster 
candidates for every patch based on 
relative self-confidence values 
(RSCONF) 



Elimination of redundant cluster 
candidates using inter-cluster 
confidence (CONF) values w 




f 



cluster selection based on minimum 
relative self-confidence (RSCONF)" 
values 



display interrelationships of clusters in 
GUI 
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[Fig. 2] 
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[Fig. 3] 
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[Fig. 4] 
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[Fig. 5] 




Patch C. 



Patch C, 



CONF(C : , C : ) = 2/8 =25% 



CONF(C jf 0^ = 2/10=20% 
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[Fig. 6] 

Profile (query q; maximum patch size m): SCONF list SCONFL 
{Let QNL be the m-patch prec mputed for query q.} 

{Let NNL be a list of the m-patches precomputed for every element of QNL.} 
{Initially, w. count = 0 is assumed for every element v in the data set.} 

1. score «— 0; 

{Initially, no query neighbors are in the current patch.} 
for t = 1 to m do 

2. QNL[i\.cqunt <- 0; 
end for 

for » = 1 to m do 

{Retrieve the number of times <?A/i_[t] has been encountered as an external neighbor so far.} 

3. score +- score + QNL[i].count, 

{Indicate that henceforth QNL[i] is in the current i-patch.} 

4. QNL[i]. count <— present, 
for j = 1 to i — 1 do 

5. w<-NNL\j,i]; 

if w. count — present then 

6. score <— score + 1; 
else if w. count > 0 then 

7. w.count <— wxount + 1; 
end if 

8. w<-NNL{i,j]; 

if w. count = present then 

9. score <— score + 1; 
else if w. count > 0 then 

10. w. count *- w.caunt+ 1; 
end if 

end for 

11. w 4- NNL[i,i\i. 

if w.cou/Jf. = present then 

12. score «— score + 1; 
else if w. count > 0 then 

13. w.count tu. count + 1; 
end if 

14. SCONFL[i] = score/ i 2 \ 
end for 

{Reset the counts to their default value.} 
for t = 1 to m do 

15. C?/vX[t].count <r- 0; 
end for 
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[Fig. 7] 
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[Fig. 8] 

RefineProfile [query q; 

inner patch size ki; 

outer patch size ko)' reordered query kj -patch RQNL 
{Let QNL be the ko -patch precompiled for query q.} 

{Let NNL be a list of the &o-patches precomputed for every element of QNL.} 
{Initially, v.inpatch = false is assumed for every element v in the data set.} 
{Identify the inner patch members.} 
for t = 1 to ki do 

1. QNL[i]Jnpatch true; 
end for 

{Initialize the confidence value CONFc of every patch element to zero.} 
for t = 1 to ko do 

2. CONFc[i] <- 0; 
end for 

{For each element of the outer patch, count the number of elements 

of their &- nearest- neighbor sets shared with that of q.} 
for t = 1 to ko do 

for j = 1 to kj do 

3. w <- NNL[iJ]; 

if w. in patch = true then 

4. CONFdft <- CONFc[i) + 1; 
end if 

end for 

5. C0NFc[i] «- CONFd[i)/k 0 \ 
end for 

{Reorder the outer patch elements according to their confidence values, from highest to lowest.} 

6. RQNL <-sort(QNL y CONFc, ko); 

{Reset the patch membership indicators to their default values.} 
for % = 1 to ki do 

7. QNL[i], inpatch <- false; 
end for 
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[Fig. -9] 

PatchClnster (data set S\ 

RSCM parameters a, b, m = <p(b); 
Thresholds a, 0, 7, 6): query cluster graph G 

1. Randomly partition the set S into subsets St of approximate size for 0 < t < h = 

Liog 2 |S|j. 

2. For all 0< t< h do: 

(a) For every element v € St, compute nearest-neighbor patches NN( Ji t , u, m), where ile = 

(b) For each element v ty i € St, compute the optimal query cluster size fc(ut,») maximizing 
RSCONF(NN(ft, v tii , k) y y?), for values of k between a and 6 

The ranked collection of patches 

C t = (Ct ti \ i<j=> RSCONF(Ct ti , y>) > RSCONF(C, y>)> 

form the candidates for the query clusters associated with sample lit Q 5, where C<,* = 
NN(i^i; t|i , and = NN^^e,;,^,,')). 

(c) Let Qt be a list of patches of Ct that have been confirmed as query clusters of Rt. 
Initially, Q t is empty. 

(d) For all 1 < i < \C t \ do; 

i. If RSCONF(C t ,t,v?) < a, then break from the loop. 

ii. For all w £ Ct,i do: if NN(ft,w,l;) g \Ct\ for any value of fc, or failing that, if 
max{CONF(NN(il t ,u;,A?),C^),CONF(C M ,NN(/2 t ,u;,A;))} < 0, then add C U i 
to the list Q t . 

3. Let h' be the largest index for which \Qv \ > 0. Let {Ct, } } be the set of patches comprising 
Qt, where Ci,j = TW{Rt,Qtjt k(qt,j ))}, for all 0 < t < h! . Initialize the node set of the query 
cluster graph G to contain these patches, one patch per node. 

4. For att\<f < t < h\ all 1 < j < \Q t \ y and all max{0, t - 6}<9<t, do: 

(a) Compute C tJ = NN(JC,«j t 2'" *(*.;))}. 

(b) Fox all 1 < i < \Q s l if C M # CJj and max{CONF(C aiil CJ^), CONF(C? Jf C a ,i)} > 
7, then introduce the edges (C«,*,(7i 0 ) and (Ctj,C 9 ,i) into (7, with weights 
CONF(C,,i, C' tJ ) and CONF(C^, C,,*), respectively. 
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[Fig. 11] 
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[Fig. 12] 
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[Fig. 13] 
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[Fig. 14] 











s, 






s 2 






z 
z 




z 
z 


Z 

z 


Z 

z 


z 
z 




INN 






NN< 








X 




3 


3 


3 


3 


a a a 




S3 


■ ■ a 




a a a 


a a a 






< 


< 


< 
3 


=< 


< 
3 




< 

3 


=< 




=< 








3, 


3, 


3^ 


3^ 


3^ 


31 

3, 






3, 




3 

3^ 









(a) 



Q> 
r+ 
O 
3" 

W 

IM* 



0.67 0.39 0.41 
0.29 0.45 0.08 

RSCONF 




0.23 
0.12 



(b) 



0.44 



no 

Q> 
r+ 
O 
3" 

</> 

n" 

CD 



"■III 



0.67 0.41 0.29 
0.45 0.39 0.08 

RSCONF 




0.62 
0.57 



(c) 



0.49 



1 4 



2003-3037014 



#2002—368276 



[Fig. 15] 
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[Fig. 16] 
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[Fig. 17] 
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[Fig. 18] 
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[Fig. 19] 
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[Fig. 20] 
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[Fig. 21] 
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[Abstract] 

[Objective] To provide a computer system for generating data structure f 
or information retrieval, a method thereof, a computer executable progra 
m for generating a data structure for information retrieval, a computer 
readable medium storing the program for generating a data structure for 
information retrieval, an information retrieval system, and a graphical 
user interface system. 

[Means to Solve Problem] A computer system for generating data structure 
s for information retrieval of documents stored in a database, the compu 
ter system comprising: a neighborhood patch generation part 34 for defin 
ing patch of nodes having predetermined similarities in a hierarchy stru 
cture. The neighborhood patch generation part 34 comprises a hierarchy g 
eneration part 36 for generating a hierarchy structure upon the document 
-keyword vectors and a patch definition part 26. The computer system als 
o comprises a cluster estimation part 28 for generating cluster data of 
the document-keyword vectors using the similarities of patches. 
[Selected Drawings] Fig. 11 
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& L fcRm & 2 o tf>/\- v * S ^£ if e> 2^ £ fl»it 5 IC&M * £ # 

fftt©U^;i/0f?:WWi.t5fe«)iCtMt5. 
[0 0 5 2] 

HT\ C q = NN(R, q, k)fc, A^yf lit', 
A^^C q $r^-r^7\°^^tt % vec q (Z)i-^T©l^*fCOV\TC v = NN(R, v, k) 

*-£tCtt, C q £, ^.©#^^^y^i:(Z)ra^C^V^Mtt$:5 ! D^^c^^g!I•r^r £#T* 
[0 0 5 3] 

SC0NF(C q ) = (1 / |C q |) * 2 V G Cqf | Cv | = | Cq | C0NF(C q ,C v ) 
= (1 / k 2 ) * S v G Cq |NN(R, q, k) f| NN(R, v, k) | . 

fe SIill:^, 3 *i W 0 (C 55 -5 < IZ o ti T rtfflS BtfPtt &/Jn $ < & £ 
[0 0 5 4] 

nr% *f*^ur^sy- Kq*«, s*#fNiiir&#M , r5Ri*©&s#^x*-- 

-fextt. i^fflfe^iffla^k^bt^(D7 77,^ - <IB&-f Sq&TctC L 
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J:»J*g<45i:, * 9* ©SIR Jfc«Stti£^1z;i/7- 

M,5:§t5. t^fX^k0^7fC qik = NN(R, q, k)tt, mffirty?-®?? 
^^-&^U-*>fXA«#(k)>k0^^^C q> 0 (k) = NN(R, q, 0 (k))'tt, 

[0 0 5 5] 

(R, v, ^©s^T^^fecDii-r-Si:, r:t\ (v,«m, ftM^6©*fT*&£ 

o 

[0 0 5 6] 

<Z)jg^S:«ftfSii:tc«eS 0 w#££, rt/^^©S*T?&*«-£tcte, (v,»m 
[0 0 5 7] 

[0 0 5 8] 

i) ©«&0>i«v%SI#W:, kw*y^flT?©|SI#1£<Z>iftv*l/'<;i/&^a^ i 
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3l-^T©k{C;fo£*K 2o<DMltrSC0NF(C q>k )££a ? l - SC0NF(C q> # (k ))©fD 
[0 0 5 9] 

*B*f -fc ;i/ y-nyyj^y* (RSCM) © ffl£ tt, fjttS <fc o IZ L X fe#,it "t" -5 

[0 0 6 0] 
max a ^ k ^ b RSCONF(C q k , 0), 

[0 0 6 1 ] 

RSCONF(C q>k , 0) = SCONF(C q>k ) - SCONF(C q> ^ (k) ) 
= SCONF(NN(R, q, k)) - SCONF(NN(R, q, <p (k))) 

Tf&y, RSCONF&, k-7ly^C q>k ©Rfe<fctJf#K:M^Sffi*f-fe^7-3>"7-f -5 s 

^^tbT^iStiS'. ft*&#;t5k-/ty^te, £<Z>MUc;b£:5q©? x y 
- • LX^m^tl^^(Dil^^>o RSCMtt, S^M^fe U (maximum l 

ikelihood estimation) ©^i: LmSifc^Tft, itltt, a£#J&©*f# 

[0 0 6 2] 

H6S, ^xu -sus©/*^ • -7u7y-( isX^mmcD^mz^ 

N(R, q, 0(b)) and NN(R, v, 0 (b)) j^-T^tC, ^TCDvlCo ^Tfljffl ^tlT* 
ifeStiC^LT^LfeHtftS. SC0NF(NN(R, q, k)) &E8WlC3lffl-t Stf) 
T?ttfc<, SC0NF(NN(R, q, k))&; l o(B7>f t" AiC «fc y A y * U Mt> 

^{Cf#^^<|f#$:^fe>3^{Cj:oT, SC0NF(NN(R, q, k-l))^e>»e>tl5. 
[0 0 6 3] 
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A'7f -t^fXit 4> (k) =min {2k, m} t. L, ' V J XHfa^v ^ - V J 

[0 0 6 4] 

[0 0 6 5] 

RSCONF' (C q>k ,*) = w 1 SC0NF(C q)k ) - » 2 SC0NF(C q>(#) (k) ) , 

(Dmm*Wtj^t-t &fS*> y c fi^tf # o < Wl £ J: o <» 2 ©iiUiKts 

[0 0 6 6] 

£©SPg-e, «3»Stlfe-e*l-6tl©A^^C = NN(R, q, m)li, l^k^mtf) 
fgH©k©-r^T©iaiCMLT, C q n ©^tl J ?n©i[f^y^C q)k =NN(R, q, k)-£ 
tl-^tllC^f -5, t^7-3>7-ff>XfOD'JX h$:#oTVNS„ SC0NF£. L 

IBtCJ:y#flgS*l«. 
[0 0 6 7] 

*nm<ni*ht£z>mo) i $mm\%> tt»©3^h»t5^©t^s 0 rsconf 
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o 

[0 0 6 8] • 

#^1"^>o NN(S, q, c)£R£ A°y^NN(R, q, k)££/&i-.g> 0 »~ 
k = |NN(S, q, c) n R|T*&£ 0 Ay^NN(S, q, k) ft, RtCfctfSq©? X 
U- • LTCD-9->^;i/R-NN(S, q, k)0>3g^lifKC|S5g LT, • NN(S,. q 

-, c)tf):/n3f ^fcig^U ititt, ±ib<D?-9 • -fey hJCWLTqtCfcfUT© 
^1 U - • LT<Z)NN(S, q, c)(Bfc«3©ffi«fci- -5 

o 

[0 0 6 9] 

o ^tiJ^^Mcte, k& % a£b£<3DWJctiJ&<. A snaths ifctfTfgfcv* 

y^0> oT|g'(£(D|5S#li, MotwanifeJ:tFRaghavan(R. Motwani and P. Raghavan, 
Randomized Algorithms, Cambridge University Press, New York, USA, 1995.) 
<Dmm W &Chernof f m&&ffi lt#5ifc^T?t5. 

E[k] = ti = c | R | / |S| 

Pr [k < a | c] g e"^ [eji / (a-1)] 3-1 

Pr [k > b | c] ^ e"^ [en / (b+l)] b+1 . 

5t^(c-DVNT, / >^<ii ; b-offi-9->^i/l:, < £ % l o©^n apt/ • Ay 

# a t b £ © <d -b- x ^ & £ j; e> tc -r & 3 ^ # -e ^ -6 . 

[0 0 7 0] 

m&m&nmmtiLTK rnqm^yfi* - v-y7;vT*&&\K i \ = isi / 2* f 
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or i ^ 0-e&£{R 0 , R v R 2 , . . .} ICO \^X^&? 5„ NN(R p q, kj) 

NN(S, q, c)®^n^r^-;l7f ttS. 331% < £ %»25JgiLhT? 

&SCl£#«3E;**lfc*»©«T-&<6. a=25,b=l20©«(IKS:aW , r-S»^tCtt, 
4>&< -D<D-V-y7m i K.MlsT, mBCD7u*i/ • Ay ^©mHI-SU- 

» i A«i<0|SBBtC«!llSS*l'5»^tCtt, NN(R| , q, k.)#RSCM&lCj: UfNiT' 
(Sv^jiStfc&tis (0. 004285 <fc U/Jn$ 0 
[0 0 7 1 ] 

*ofc<tj£3fehra«©t>©1?S)»J, ^flk-t*5l^t±, HRWtC TteV^fc 
[0 0 7 2] 

RSCM&#^n3f>> • A 0 ^^NN(R p q, k.) LTHMS 

oTtu SK&V&9?x*-iztt1fc?ZVJ X*Mfe~rz>fcisb<DiEm 

fc^SttfcV*. OfrOfctffc, E[k] = k-T'GDc 

= E[k] |S| / |Rj|©Htt, MW?**- ' f>fX©I^M=fey 

, Ufc^ot, (a |S|) / IRjI^&So 
[0 0 7 3] 

© # 9 * - < o * -9- > :7;wc ;b o r $ *ia \% 

. nwzxz-w^ xo&m&vi*. ^mx^^ztizmM^tir=.^ 0 iz* 

£*i&v^li?-e 9^XZ-<D&&*B?mffi&<D2bZi%n£.%iZ>Z\lii$ 
[0 0 7 4] 

#38i|i!Kfcv*Ttt3 e>tc, #ii<D#xy -Mi?^©* >A-©ifi^©f(j^{c j: 



3 0 
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[0 0 7 5] 

* ^ * # -ft^iplC L o A y ^ ©Sag SrHM-T S 3 £ #$frJH#JT' & Z> 
[0 0 7 6] 

tc ^ L ^ n -ir ^ tt T tf> «fc e> I c L T H fr $ ti § . 

, j£*S*flI>*y*'(BIS3R£*xy -mmiZ3S VNT^S *l£k-;a0|,£<Z)8tJC L 
fe*?otS3fiS:fT^. Jlttlfttctt, -T ^TCDv <E NN(R, q, *(k))£, n>7 
^^^ItcoNFCCq, c v )fcLt^ot, MMfrbMci&^t^yttfVZftotifr 

b<D&MlZf&CTmffilZffi<onZ>ZLillZteZ>) „ C q = NN(R, q, k)T'& 

U, C v = NN(R, v, k)T**>-5 0 

i i) m*jM^fre>m<z>mmz* mr^mm^tit^ - ^tti^t 

^ 7 X ^ - ©|?i©|(Ji & Jilt ^ < £ tfeT'^Sc ro^&tc^ 
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[0 0 7 7] 

• v>zfji{R Q , r v r 2 , ...}©JKii**e>f§e>ti<5*xy- ■ z^^z-m 

SM^(query cluster relationship:QCR) CD ^ T7 &fll£5"^ &„ ££T», f^T 
©j<iJ3«ktf |Ril = ceil (ISl/2 1 ) for 0 ^ i < log 2 |S|ICoV*T, Rj C Rj 

i i) {Z^T.Z-m ftfcH-©t^X©2o®^7X^-©n©3>7'ff 
i i i) (M^M) l®#^^^^-©ra©3>7^f ^ >^©#/J^©L/^V^'jiir 

i v) (lt#x>?r-;i/) 2^<Dmft-tz>V5**-(Dm<Dx>T-Ji><Dm<z>Mj:<D 

L£vMB5 ("T ^|i-j|T*&U, R^Rjilte, 9 9 V 

y7;vT&z B ) 

H9li, #5§HJ!&cGgM£4x £ Patchc luster^ L fc^Kn - K0>if yzffr 

[0 0 7 8] 
1 . QCR7 - K'tfh: 

-€■4x^*10 ^ t < iog 2 islicov^, vy7w t <Dm&fr*>. R t tf>M&£?x 

V -^0iC»^ < j etl J € ? tlC i = NN(R t , q t , k.)ffeot, a^ |C, | ^b<7) * ^ * 
^-)d^^5^x'J-- *^**-QC t = {Cj, C 2 , C| Rt j} $riR^-T-2)o RS 

C0NFlCbfc#oTdgJBT*#5*Xl) - • * 9** BftB*IS»JQC t <Z) 
S3R&ig#li-<5. ^<Z>#-£\ i<jT*&;TW£> RSCONF(Cj) ^RSCONF(Cj) £ U Z.<D 
fcft©*#i:lt, TlH©2o©^$:igffl1-2> 0 

i. -T^Ttf)l^i<j^ni t &C0VNT> max {C0NF(Cj ,Cj) ,C0NF(C 

j,Ci)}</8.fcl, ; 
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ii. -r^T<£>l^i^|R t ltCo^T, RSCONF(Cj) ^ a £ f 3 

o 

[0 0 7 9] 
2 . QCRx >r^-t'yh: 
QCUc£tt5i^j^i + 8 Tr&£ J: 5 -5 # x >J - • 'C| = NN(Rj, 

q is k i )33«fctfC j = NN(Rj, q p k^cD^ti-^ftcA^y JCovVt, C* j = NN(Rj,q 
j ,2^ _l kj) i: IT, max {CONFCCpC* j),C0NF(C j,Cj)} ^ r CD^^tC. *Ba*3ft 
feXyS^Cp C j ),fe«fct>*(Cj,C i )$:, QCR^^ :7(C#A-r<5„ CONFCCpC j)fe<k 
tfC0NF(C j,Ci)CD|tS:Xy*;(Ci, Cj) , fc<fctf (Cj .CjXDM*^ W t. btlfflt 

So 

[0 0 8 0] 
[0 0 8 1 ] 

^£*l£ 0 QCR^-7&, ZL<Dfctot>?frtemffi1Ztt<D&ffife<Dt^X&-Z<£ 

ICJ: »J5t«Sn*Cl HlZ&2> B H9tt, {C£^T "PatchClusterST 

[0 0 8 2] 
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[0083]. 

PatchClusteri£(Z)S'J©^Mte> * # -(DmOfflfflZ^ti&OT'&Zo ± 

i i) u^tj^e>»e>*is^xy- • *^**-©iift#»ic#^T<BS:*:©l, 

[0 0 8 4] 
[0 0 8 5] 

PatchCluster&T'te, PatchCluster/RSCMtf) A=y * - Z it. _ti& L £#&-r& 
[0 0 8 6] 
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;i/ <D V A X #/J> 3 V n Z. £ iZ J: £ ^gfr & j£ flg ^ £ JC & tZtt * % < f -5 # & 
[0 0 8 7] 

'S <fc^Jca^$*l«<6MI««*y v * (b)=2b£ LTl^tSZ:i:^i3iWtfe5. 
U^U3ft^e>, * (b)=1.25b$:S^UT : fo, £ 
Slt^IffiStlAc. *|g0^O*^©||5feC7)^JCfeVAT«, a=25. b=120, 

k)=min{2k,150} £lT3^i:T*, ^tf)^-* • -fey h ICO ^T^M"r 

[0 0 8 8] 

PatchClusterj£fC J:U^x.<b*l£^;*# — fbtt, btf)lE5i&3g#*tC teJtgelft^ 
[0 0 8 9] 

« ^ (Z> * ^ x # - # JBE M W & & © £ to T /Jn $ & V 7. Z - /\ i: $ 4x £ ^ ^ 
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yfr^XZ* Z<Dffim> OCR?? 7 It. rL--»f Zmm-t&OiZ^^X. gfrtbX 

fc^ic si*, /hsftjfeaa LTiHRSftfcttfttf&ibfcv^ ^oHes, 

&v^&5=4£t-£^£-e\ fre>2 5 <a^tf>Mtf>-tf--f x<z>^x 

[0 0 9 0] 

(a) 9^x*-<btttt-t;\/7-zi>747 : yz>izft1r&&/\'\<DLg\ l \j&a (£ 
3fthi£S!J{c, -£ft-?fttf>-tr>:7;i/ • iyOvizM LTS/htf>*^*#-<z> 

*») = 0.l£a^0.2©ttH©teJW£Lv%. <fc »J/Jn3 <-f££, 

<Z>L£V*ter • M#£ft£<&M&&V^ o 0.15^7^ 

(c) ^-ft-fftGD^xy - • *^;**~lC$IJ§<*ft'S*-C7- K • 9 
[0 0 9 1 ] 

PatchCluster?&©£ 6 JfesaygWi, iEfit&i£0|/\° y ^ £Hffi"f <5<JDT?W:fc < 
, iSMftifiMAy^Sftlt StiCDT'*5 0 PatchC luster^ (c <fc »J HffS ft 

>5/-y^/tft3R«k yfelff^tCCKOfllSSr^-t-SfeAtC^MSftS. 
[0 0 9 2] 

$ e>^-5PatchCluster^©^M«, K^pa^ > h -3r-»7- K • h;i/£ 
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[0 0 9 3] 

£#WfcPatchClusterSSM:, M 9 IZjjk'i' £ O lC^3f X h • # (Cilfflf Z>Wi 

dfrz.* > V<D7K<D¥-1%<D*-V- K«KI £ LSI * fcliC0V£ V\ o 

[0 0 9 4] 

PatchC luster&tf) $ P, lzm<D$£mffliZ. QCR^5- 7©ffi^blC^**-S 3 h tfi 
X°g£ 0 PatchC luster^tc «fc U ^X. *l£QCR^ ?\* % &&<D9 -<Dtt 
lCOV^<Z>^#fif#£-£t? 0 L^L^A^Iffiliii, ltlT«ftt5rt 

[0 0 9 5] 

i) (UKKffiCD&m&lteX-V &Z$m~?Z>) Mill ^7 7^u<v<«T'fe5 
• y- KCj = NN(R U , q lf k^, C 2 = NN(R y , q 2 , kg), Cg = NN(R W , 
q 3 , k z )&&&mft^VP(C v C 2 ), (C 2 ,C 3 ), (C r C 3 )$:-^^t)(Z)h«^^ 0 X 
V V (C x ,C 3 ) n.— if # (C 1 ,C 2 ) 33 <fc (C 2 ,C 3 ) U TC^ £ C 2 ^ £ flcSS L 
T^$tl-5©T*rL--!f^P,|il-r- £#T*£5o 

1 i) 0»*^X 2otf)*r7*#-C 1 = NN(R U , n v k^fc^tfC 

2 = NN(R V , q 2 , k 2 )^, C0NF(C 1 ,C 2 )^3J;t>X0NF (C 2 ,C^ ©M##?g#&C]«V^ 
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NFffi feftsri:, * W:* >f * :* £ v * £ £ v * o fell * ©ft^t'I^ts z. 

yS?(C 2 , Cg)#, (C 1 ,C 3 ){C»$tl^>o -£<Z>*g*V%#fcSMOfcx»/$;T?fc 
[0 0 9 6] 

LT*xU Clfc&S***. b^L&#£>, ?xU-Hitte, 

* - <d mm & ta v * «r tus & $> a . 

[0 0 9 7] 
[0 0 9 8] 



mtiE#2 003-3037014 



#2002—368276 



. . = 1, if i = j, and z. . = 0&£<Z)J:e>lC, 5t(D \Z*c=l* y bQm&m®. 

1 9 J 1 1 J 

h;Wi= (zi ,1, zi,2, zi ,d) £ -f -5 - £: 5. nrT% nU- 

3.tf)fB*&£&MLT, iS@©ffli(Co^T©X37(t ffi#{Czj -/ttbt 
IHMt-Sifc*«*e#S. II Z| II = lT'&U, itt'ttjeiRftO-C, d 

[0 0 9 9] 

2- x ' fi/ \\ H \\ = cosangle(zj, fi) = cos 0 j, 
[0 10 0] 

(v, m)\t % rkrtfflnznfr&ffl<Dv&&&tiiz*n?n^m&. v &j;t>v 

MLTcosangle(v' ,w' )i:LTW^Ii:^T'i^ Uc^oL 

feitJfittCD^TcBiaSStlfcffl^S-etl^z' ifeJ:t»' tit, cosangl 
e(2.,^)tt, cosangle(z' n * )t«tl>3i:^t'tl), * cosangle(z 

" Ir^fflbt, cosangle(z' a" )T?jE<«"r * £ £ #T?£<5. Ml/z' 
[0101] 

IkTzMM Lfct^xZ-n^^Vytfmt. TIB ©il U T*& S . 

i) i§B<Z>T > U If a- MCoWT, -f^TOl^ i^N^C^f bt, «C7cH!Fi«T h 
U Ifa.- h • K;i/Zj = (z i}1 , z i>2 , z i)d )§:S?)^CftItIt5. 3 

ii) = s v e NN(R, q, k) v *tt3^£o 3ir.T% vfc^fctfqte, IkjxMWL 
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i i i) ^^7X^-©fItl)7^V©S[i:bT, mzlStf&ft" (DX-M 
i6j& & , coszngleKmcom £ ffi'P it^i" 3 „ 

i v) it ;iM©3S^<Z)^>*tf«-£*i£'J;* hic^-r 

£T h U t^rr- h £$t#-t£„ ZLtlt^SfJlC, cosangle-£*ieflctf>'f»£;i--9 i {c 

^fLT^t^r iii-e^s 0 $ <bizffliZiB:®L&jteM%:m*. sash, £ 

lisu©^^^^ l x $ nt=. j: e> ic^m -r & z n t> t* ^ s „ 

[0 10 2] 

h 1 1 . mm&ma>t=.&(Di/7sT2* 

0 lZ7jk£nz>& o fc, S/^-Mi^Etan^ t?n.-# 1 0 r-f^^l/^^Il 
2£, ^r^-jJ>— K 1 4 ilVv-o ftKll^r X il, 1 6 £ Vn o £rf<-f > # • 

fc©^xy-5:A*t^ii:^T'§SMi:$litv^ 0 n>tf^.-# l o te, 

IC, K^r^^ > h £^-#^-*fr<b^i?-f £. =r>t?o.-# 1 Oli 

, LAN, 2; fc&WAN££&>f N ^ofcitfg^-f > 2 0/\£, Ethernet 

[0 10 3] 

If7^f>2 0#, ^||<^^CD-9->r h$r^5^M-r-5>LAN/WANJ3i:I>V^fe 
hr-fe^tf^-T^Ui, n^tfa.-^ 1 0 ^-fy>h£J: 

$*iT&&v^ -9— A • n^tfn-* l 0 li, *»W©TA/JU XAlcj: y K3=- 
h&g^Ufc*xy'-tcW5aUT*fc$BU ^Stife^m^:, i'x'J-S: 

53i:t)T'tS 0 ^tiiilig'Jtc, n^tfj.-* l o tt, 4tJ£©flIa£lc*f LT#ai 
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Hi ltt, l 0 ftlcH^3*i£$&fg^D v 9e>ftMm"?:$>2>o n 

S/^a.-* 1 Ott, HEfe^^ h;i/^^SP2 4 SASH£j»3 6 £. SCONFU* 

wmznz^&o s<>? b;i/£*»2 4 it. K • y x hJfefetemjtcDMirj 

HffU K*a*> h-*-<7- K • ;TOir£ a * 

-*£^o£OT&fB«M^£l&Mir3o SASH£*£gP3 6 £ £t>V\°y^£D£3P 
2 6{£, #^{C;fctt&j£^/\ 0 y^£jM3 4 Sr^ffc-fS, 
[0 10 5] 

sash^^p3 6«, mziz&^TmhfzT'ji^v x2±zmm^Ts\sKmm.zn 

3S£«Tfg£ bT^£„ SASH^gP 3 6 it. ny? >( ?y7.$kfeW>Z 8tf. CONF, 

sconf, rsconf feny 7 j ^y Z> ttfr* g Z> £ o izmmx* 

[0 10 6] 

*:ny - • h;i/^^P4 6 it. ^^fcit^my - • df-y- 
<D*j<Dtf&£ 0 loit ?T°iZ$fM-£tlTT-*<<-X 3 2 IC&^3*1,£* 

it. ^034 Oit. *3iV-<D$-(zfZMffiir2> 0 *xU-#&f$t3P4 OlC*fL 

it. ^i'j-tt ^^yM^c3 o izfeffiztifcsksmm.izft Lt^x'j -£$a;g 



4 1 



2003-3037014 



#2002—3 68 276 



^W^ilA^yf • ^-^fc^tf^tUCfl^SCONFy ^ h ^-Kf-fX^ 3 
2^e>n?ffiU ?^*#-ft:3>7^>XSC0NF£J;t) ? RSC0NF£:, 

o 

[0 10 7] 
[0 10 8] 

IC«, ^SP4 0tt, SASHt-*-^ 3 0 fcn^fflu K=ap3L^> h-*-?- K • * 

3 o w©?xy -©^^ic^M^tiT, 5t;<z)^xy - tcj; ytfcassn 

-r>*&5£gB3 8Ati^tlt, AyfMStlS. £ ft b 
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ZV>%kJiv?-&mB3 2^nm-MZftZo &mZtlt~Ay?-te. *<D&Z^7s* 
[0 10 9] 

GUI-^-^fftfiM 2 fct, y - • f 1 -* £n>ti°rL-# 1 

0Kigt!g&&3;txfc^;*:7l/>f§g* ^^^*J, 7^ *V U>f • * 

[0 110] 

A-M I I. ftm*2Zfe't'&t=-to<nMm<Di'i-V* 

-Mi, tfF^>ht^-y- K • f-^t&7xf')'7 , S4 0T?8&*&#, 
V &o r;i/^UXAH >y ^ S 4 4 t'±M UfcLSI^i fettC 

r;- J? . ^©SASH&fcgg'tS. 01 2 iCfcV^i* U7c7;i/=T U XA&7f 
[0111] 

2*iT, 014 (a) tC^«fc-5JC#K^3.^> htf) 1 ofiD;1yf?:^tS„ 
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«fc £f RSCONF-fflt 3c ff^C L T HI 1 4 (b) £#;t, ^©Ife^y^Sr-etl^ORSCONFffl 

^^fc®i4 (c) tzm-tmMt. bruits. 

[0 112] 

2$:#Ibt, i/t'JtA©7;i/^'JXA{i, X^ryzf S 5 O^iiii 
£-SASHU"<;MC£^T |8 =0.4J^T#) > 7 -r f^^cDft/^y 

&<D-t'<TlzMLTrtv?-<D&m$:M1R~rZ>o *<D&* RSCONFffitf a =0.15J^_t 

S46-S5 2{C5g^f^^-^#jt5:> 01 5 iC^fo 
[0113] 

*fiSi-S^r-C- Kt*CXf>y^S 5 6&C&VNT, 01 7 {C^$tl-2><te>^^ 
77i^i:lT, >f^7&UZ$-x.btl& a Ml 7Tlii/i-U^-A{CLfe^oT^ 
^$tlfe^^X^-CD^^7CDSP7> (r=0.2) #5t£*XT;i3U, -?-tf>|gUC&, CO 

^7^^-0t^t>y h&Mlt^s (^^;i/(Z)jiH#tctt, fc>-ffr&jiv^«r 

TO# <Z) f B*<Z> ftk ^ ffl # > £ J; tfffc © il #> & X <D IB* 5: ^ ^ T* ^ § . 
[0114] 

i) m(D^^=L^y W*.fc£, >w-*-u ^;i/££teTF-IDFM<?*tftt£^ 
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(UiM#JU:«200~300) £>K? . -fe y h £ J: 5 tC, COV* fcttLSI^5cS9 

i i i) k-*5E#j&*xy-&»M**SASH«arrs. Stfc, 0^t^h<z)t 
#g©SASH^/<;i/i: u ^>#A&-tr>:/;bR t = S t U s t+1 U ... U s h £-r 

£ S Q tt, SASHl//<;i/tf)«<£§P£1-S 0 ) 

i v) O^t^hcDt-^TtCoVNTv (= S t T*S#MSRtC-3 V%T, m = *0>)©g^ 
Kov'*Ti£«tt&m-*e#j&y;* h (m-A°y^) NN(R t , v, i)SlffiU « 

v i) ?kjtnmtfiftt>tit~m&izi*. b&^ti^nw^v- • 

U fcfn.- hT*&£=Sf->7- K©-fe V h 

v i i) #jg&rL-1f • >f>^7x>f^&^Mbti-ti ! A^7^^>^t53 
[0 115] 

R»4^f*^S:01 8Kl^t. *fe> ^ y #B<D&«&C£V^£j&£ *l5-r 
-*#3i£:, 01.9feJ:Df02O^t. Ell 8{C^$4x£ <k e> JC, S/^-ytfB 
<Z)yn-feXtt. X^v^S 6 0T?SASH#3t&£fftU Xr-y 6 2 Ai:I^T 

y, dti^sria^j^^^y^fa^^f&iw-rs. ^©n&sASH^jcfe^sy - k&, 

^f«y^S 6 4 lC£V>TSASH#^£^MLT*:ny-fCM3*LT&3if^5o X^r 



4 5 



ffiliE# 2003-3037014 



#2002—368276 



y#A ■ -y-yzf KizttLT RsemMZ 3-- ->f A Jj? 3i V -t;o^t)g< 0 

6 8 lC;fe^T-£;L£ 0 ^f^^S 6 4~^f7^S 6 8 {CfcVxTf#?> tlfr- Z 
#Hft#, ®2 0 (C^£*lTV^ 0 
[0 116] 

2 - i ) i/i- U ;*A<Z)#JlK & i — i i i * T'Wi U jg-Tc 

2-i i) 3.-*?lZttl,T?X-V-mm<l&£X$*-tfy h • # # -tf>1*--f 

xkzm&-tz> (T-zmmizftLTiti&mT'itft^') „' 

2 - i i i ) t a = max {t | k / 2* ^ a} £ <fct>*t b = min {t | k / 2 1 ^ b} £ 
tflt^. t<T(Z)t b ^t^t a (;f LT, NN(R t , q, m)£ftJri-£ 0 m% m = 
*(b)-e$>-5 0 v <= NN(R t , q, m)©-r^TtC*fLT, NN(R t , v, m)?kt\%1rZ> 

o 

2-iv) 1-^T<m b ^t^t a {C*f LT, R t lCoV^T(Dq(CM-r>g>RSCM^II<Z)^T' 
fe<g>k(q,t)$:Bm-ro 

2-v) t^Tfflt b ^tgt a i:jitlT, ^X'J-- ^7X^-NN(R t , q, k(q,t)) 
-*7- Ftf>-fey h££fi£-r£ 0 ^(7)#JlglCo^TttHI 1 4 lC;i3V^l&HJ!b£iiU 
2 - v i ) %hfrtz.>7 ! 7 7.%-mMv K -Ztl^W^ X -?-tl^©Mj^^-5m 

[0 117] 
[§H»J] 

* ikm t z> fc #> \z , © £ u £ 2 o © -r v it t l x mm. 

A. Times- n.-* • F^fn.* > h (Z)^- # s<- * Kjgjg -5 Z\ £ IZ J: 
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L£„ T-Z^-Xl*. M=127, 742 K^n.* > 3*1 £ e>N=65903f 

-y- K (T h U tfzi- h) £T h U tfn- N • -fey NilbtiKbfe. 
J: tftflffitt % 1" -5 1=. #> tc, f 2 - 3? * - x ic*f b X^fcnm £ frfc V 
^7cB!IW(C0V)«:ffJ&v^i£©2o<©#i*S:ilMLfe. TIE© 

(a) TD-IDFMMS^#lT$:6590M©r h U If a- MC*f LT?To 

(b) 1 0<Z>Hlfc1Zy MCfcV^T, C0V^CtcBJ« (6590 £ 200^) -ffi 

(c) K^ra* y h<DWL75.&&<DtiimiZ'D\,\X. ¥7 h • -fe y > V<DSk 
SH (a;-KCD^^;<*/f-fP=4, ^f-y - Ktf)3r-W\°£/^-f c=16) £^ffiL£ D 

(d) y h u if a- h • h^/CDflfcia^ottsSKiov^T*), f=7*;i/h--fe 

y^-f >^<Z)SASH (£U - Ktf)3r^/\°S/^ p=4, - Ftf>3f-V 7^>f -f c=16 

) zmmhtc. 

[0 11 8] 
[0 119] 

(a) Ay^lSa^^U ^ : a=25, b=120, * (k)=min {2k, 150} 

(b) K*rz.*> h<Z>ft^/&&${tlCoV%Ttt, ifi{«©fif^tcl2#£-^;t£r 

[0 12 0] 
a' = 1.25 
= 1.25 <t> (b) 

(c) *^;**-<Z>*B*f1z/I/-7-:2>:7.f 7 s >XJC*f'1-<&*:*:U£v^&, a=0 
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A5t L£ 0 

ft*©U£v^£ 0=0 At: Vfc 0 

(e) QCRtf^yiz&H&mftt^xz-nmo^yyj t^xcdjr/m^vmb 

(f ) QCR^^^(^2o©|ffi#^^X^-(7)K©^^-;i/{C^3^'§>M^*^:©L 
[0121] 

ft-JlT^U Java (3£^|&fQ (JDK1.3) fc^M LTfBi& U fffpN- 

KtfxyfcJu Windows (gftj&g!) 2000§r^^-&felGHzcD^O-fe y-tf-j&gfc J;£>*5 
12Mb©^>f >^l^U Sr^iSLfe IBMttSS© Intel liStat ion E Pro£U£o 

[0 12 2] 

2-1. ^^f(SfeJ:tJ f iB«3X h 

©klCo^T, SC0NF(NN(R, q, k) ) <Dj&<Z>7°U 7 T <i JV & W^{ClfW1" S ^ t 
[0 12 3] 

• y?7*nm-tz&T*(DtimzmLt~mf%m<DRmT'$>z>o 
-ikt3&&7* : 77wmo)mMw&^7> hit. t^xcom^m^y^^-tvizsm 

£ ft £ & © i: 5£ b T v ^ „ 

[0124] 
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Table 1 



mm<Dtz6b(D^X h (Mb) -X.5tW\M(DW,'£i \ 


K=^a^> KCDSASHteSfl 


30.1 


KCDSASHteSfl 


1.6 




161.6 


l&SBmtfETiFill P1M l£XW&M Stilt kift ^M^KM 


204.4 




5.3 




403 



[012 5 ] 

[*2] 



Table 2 







C0V;*5cgiJM 


h<7> SASH*g^B#F B 1 (s) 


460.7 


898.8 


K© SASHflt&i$M (s) 




26.6 


±mmmmttm (s) 


7,742.9 


13,854.6 


4tH<DB#P B 1 (s) 


126.2 


81.8 


±ftm (s) 


8,329.8 


14,861.8 


£B#P*1 (hr) 


2.3 


4.1 



[0126] 
2-2. je«ttfc*iE«J&tf-J* 

i'J-Co^t, nm<DF*=L*y h&^tety h^e>a£<K«r*"-ft5£«j^U 
[0 12 7] 
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Table 3 



SASHttfg 


%7cH<JX4: L. 


C0V*5c»J« 


SASH 0 x ij -jMffJt (s) 


3,039.03 


2,714.57 


SASH ^7 x "J — &m (ms) 


38.85 


70.74 


¥iSSASH$xy— »j£ (%) - 


62.93 


94.27 


3E«f NN ? x <J — MfUKs) 


127,742 


]Efit& NN ? x 'J — B#ffl (ms) 


1,139.19 


2,732.5 


m-^x'J— • ^7x9- 
<Offi*tt» (x 10») 


4.59 


4.07 


JH-$xy- ■ 1^X5 -mm (s) 


5.87 


i 10.68 



[0 12 8] 

2-3. *xy-jc,fc.5£#0>#^;**— ffc 

[0 12 9] 

5/ y *aic £ t &7cll!ftt<z>£Mic £ y £ s ft * ^ * £ tib* 

[0 13 0] 

[*4] 



Table 4 







COVjfcTcM* 


6400 - 30720 


1 


1 


3200 - 15360 


1 


2 


1600 - 7680 


8 


8 


| 800 - 3840 


15 


25 


400- 1920 


32 


50 


200 - 960 


70 


84 


100 - 480 


206 


135 


50 - 240 


405 


216 


I 25 - 120 


760 


356 



[0131] 
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*mwomy^=L-* - ^ny^Aii c-g-fg, c++nm. Java (&mmm) 

yUv¥- -T^Tst. m^~-f. A-Ffa^, CD-ROM, DV 

[0133] 

e>^^^^-«, *feffw$4x-5 0 z.frh(nmm-£. feteommzis^T^mm 

[0134] 

a 

[0 13 5] 
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[0 13 6] 

RSCONFfflJS j:t>V\ 0 >y^ • 7°D7 7-fMt * 9 X # - £ R£ U itftStf^y 
[0 13 7] 

[0 13 8] 
3fc#6 : X^r-^tf U 

©^9^^-ffc©fcAtCPatchCluster2g{C^Si:S4lS?Ra£W*Nrratt, 0(|S| 
log 2 I S I + c 2 )£:&£ (0l±, Jt«T?&S. ) . ZLZ1T% cli, ^$4lS^^^ 
#-<Z)$:T*&£ OHM ft tc 8:, | S | J: y %^&»J/Jn$^) . $felcffi^f S # -J* 
it. 7"n77'f;i/MU RSCONFMlCLfc#oT*X>J - • ?^**-tf)^1f 

-£stT% 0(|S| + c log 2 |S| + c 2 )<7)^r^T'H^ft--5^i:^T' 
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t5 = 

[0 13 9] 

v ^ & aj -r £ #> iz it %± iz iEm & m i&m& u * h & &m hi*l&^ 0 mm iz & 
i&&mzsEmtem&®&v * h & iz ^m-t^t^^iz sash t ^^t^mmnmoy 
-^zmm-tzztx*. z*>izzi*h$h%:tf$LVz> 0 covikKmmzmmhfeL.k 

.Times- j.-XfH*©^-^ • "fe^ htCoVxTli, SASHli, V->T > 

*:tffim©fe*®^aa^^A#li. 0( | S | log 2 |S|)T*^>L<D*l3SASHfll3S<Z)£ 
Mc J; U3t@B£*l5 0 
[0 14 0] 

tf «J rffflm2tif~mffi<nBnizttLT^MT°&& z txmMZti. Ife^ot 
[Mtf)ff9#&t&B,l!] 

[01] 01 ##§BJ!tf>^-#«}§&«i^s£&<D7a-^- hZiskL 

[02] SASH«^$:«^^^>^i?)CD^CD«EfBS6U^^n-^-V- h 0 

[03] Av^mmz^tiSksmzkoifflmm* 

[04] *!§0^{Cj:^A^^«it©M^^^miS&0= 

[05] n > ? ^ > * mmcoKF<D$m<Dttmm & ^ l £ m* 

[0 6] SCONF©ftJ|(A£&<DM^lft&»{:=i- F£^L£0 O 

[0 7] Ay*&£t$1z)l7-ny7>(Tyx<7)WMmm.$:*kLfrmo 

[0 8] A'*/?- • ^n^T-r^^Jgif^^f e>£&<AM^lft&M=i- Ffc^b 

fe0o 

[09], PatchC luster j£ (A° y ^ ^ > * tftffe J; fcl^fr) tf>M^#J&Jl 
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mi 2] 9^x*-£*nb<Dmmm&<Dy^y%&&?zt=.tb<D#! l m<D7 

013] 01 2 lZmLfris1-V*h(DMmicm&Lf^-*Mm*mBLt=. 

a 

01 4] 01 2 {C^Ufe>>^';^AO^aiCMLfcT : -^^?r0^Lfe 

o 

01 5] 01 2C^U^^>^-';?f-A<D«t ! S{C5l3SbfeT r -^^5:0^Lfe 

o 

016] ^7^^-©W#©y775:gL/fcgI 0 

01 7] ^^x^-(Z)tBB5B5^©M^^^^^^^$:^L^0 

01 8] ?^*$-£^ftb<DmMM&<Dy^y*&!&'tz>r=.&>(D#im<D'7 

01 9] 01 8lC^Lfe^TMJ^B©^lC08^L£^-##£^£0^L£ 

o 

020] 0i 8izBLt~*s+v *B<DMmizmmLt~T- $m&*mmi>fc 



-fuyy^jv^^kLf^mo 



1 0 • 




1 2 • 




1 4 • 
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1 6 • 


-TUX 


1 8 • 




2 0 • 




2 2 • 
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S12 



S14 
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ft® (CONF) 



S18 







ig^ (RSCONF) 






F 


GUI icfclvC*^**- 







•S20 




S22 



ffiSE#2 003-3037014 



[02] 



#2002—3 68 2 76 
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level (L)=hlc|S^ 
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CONF(C jf Cp = 2/8 =25% 
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[H6] 

Profile (query q; maximum patch size m): SCONF list SCONFL 
{Let QNL be the m-patch prec mputed for query q.} 

{Let NNL be a list of the m- patches precomputed for every element of QNL] 
{Initially, xv counts 9 is assumed for every element v in the data set.} 

1. score «- 0; 

{Initially, no query neighbors are in the current patch.} 
for i = 1 to TTi do 

2. QNL[i).count <- 0; 
end for 

for i = 1 to m do 

{Retrieve the number of times QNL[i] has been encountered as an external neighbor so far.} 

3 . score «- score + QNL[i] . coun t, 

{Indicate that henceforth QNL[i] is in the current i- patch.} 

4. QNL[i ]. count <— present; . 
for j = 1 to i — 1 do 

5. to <- NNL\j,i]; 

if to. count = present then 

6. score «— score + 1; 
else if w.count > 0 then 

7. w.count 4- tu.count + 1; 
end if 

8. w<-NNL[iJ]; 

if w. count = present then 

9. score *- score + 1; 
else if w. count > 0 then 

10. w.count <— w.count + 1; 
end if 

end for 

11. w <— NNL[i, ijj 

if w.count = present then 

12. score 4— score + 1; 
else if w.count > 0 then 

13. w.count 4— w.count +1; 
end if 

14. SCO/VR.[i] = score/ 1 2 ; 
end for 

{Reset the counts to their default value.} 
for i = 1 to m do 

15. <?A/i.[t].count 0; 
end for 
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[0 7] 





SASH 




i=n-1 


0 


NN (fWiv m) 


i=n-2 


0 




i=n-3 


0 




■ 

■ 
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■ 
■ 
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■ 

i ■ 
■ 


i=n/2-1 


1 


NN ^ R V q (n/2-1)' m ) 


■ 
■ 
■ 


■ 

• 

■ 


■ 
• 
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i=n/4-1 


2 


NN ( R 2'q(n/4-1)' m ) 


■ 
■ 
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■ 
■ 
■ 


■ 
■ 
■ 


i=0 


h 


NN(R h , q(0) ,m) 
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[0 8] 

RefineProfile (query q; 

inner patch size kj; 

outer patch size ko)'- reordered query kj -patch RQNL 
{Let QNL be the fco-patch precomputed for query q.} 

{Let NNL be a list of the *o-patches precomputed for every element of QNL.} 
{Initially* vJnpatch = false is assumed for every element v in the data set.} 
{Identify the inner patch members.} 
for % = 1 to hi do 

1 . QNL[i] . inpatch <- true; 
end for 

{Initialize the confidence value CONFc of every patch element to zero.} 
for t = 1 to ko do 

2. CONFc\i] <- 0; 
end for 

{For each element of the outer patch, count the number of elements 

of their ^-nearest-neighbor sets shared with that of q.} 
for i = 1 to ko do 

for j = 1 to ki do 

3. ti/<- NNL[iJ); 

if w. in patch = true then 

4. CO/VFcft] <- CCWfcfi] + 1; 
end if 

end for 

5. C0NFc{i] <- CONFc{i)/ko; 
end for 

{Reorder the outer patch elements according to their confidence values, from highest to lowest.} 

6. RQNL ir-sortiQNL, CONFc, ho); 

{Reset the patch membership indicators to their default values.} 
for i = 1 to ki do 

7. QNL[i].inpatch <- false; 
end for 
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[09] 

PatchCluster (data set S\ 

RSCM parameters a, 6, m = y?(6); 
Thresholds a, /?, 7, <$): query cluster graph G 

1. Randomly partition the set 5 into subsets St of approximate size for 0 < t < h = 

Liog 2 |S|j. 

2. For all 0 < t < h do: 

(a) For every element v € St, compute nearest-neighbor patches NN(fl t , v, m), where rZ* = 

(b) For each element vt t i G St, compute the optimal query cluster size k(vt t i) maximizing 
RSCONF(NN(Jle, v tti , k) y <p), for values of k between a and b 

The ranked collection of patches 

C t = (C tyi \i<j RSCONF(CW) > RSCONF(C, 

form the candidates for the query clusters associated with sample Rt C 5, where Ct,» = 
NN(/fe,t> M ,A;(v M )) an d = NN(A,^,fc(«(j)). 

(c) Let Qt be a list of patches of that have been confirmed as query clusters of Rt . 
Initially, Qt is empty. 

(d) For all 1 < * < \C t \ do: 

i. If RSCONF(Ct,j,^) < a, then break from the loop. 

ii. For all w £ Ct,« do: if NN(/2t,tt>, k) £ \Ct\ for any value of fc, or railing that, if 
max{CONP(NN(li« > itf,fe),C*,0,CONP(C M ,NN(ile,u;,fc))} < 0, then add C M 
to the list Qt. 

3. Let ft' be the largest index for which \Qk>\ > 0. Let {Ctj} be the set of patches comprising 
Qt, where Ctj = NN(i2*, <?<,>, for all 0 < t < h*. Initialize the node set of the query 
cluster graph G to contain these patches, one patch per node. 

4. For all 6 < t < a', all 1 < j < \Qt\> and all max{0, t - 6} <s<t, do: 

(a) Compute =NN(H i ,ftj l 2 | - , % J ))}. 

(b) For all 1 < i < \Q,\, if C Sti * and rnax{CONF(C^ t CJj), CONF(C( fi , > 
7, then introduce the edges (C a% i,Ctj) and (Ctj,C St i) into (7, with weights 
CONP(C,,i, CJ j ) and COTfF(C' i%j , respectively. 
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